Recognition: unknown
Accelerating Speculative Decoding with Block Diffusion Draft Trees
Pith reviewed 2026-05-10 16:00 UTC · model grok-4.3
The pith
DDTree constructs a draft tree from a block diffusion drafter's distributions to verify multiple trajectories in one target pass.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, these gains place DDTree among the leading approaches to speculative decoding.
What carries the argument
The draft tree built by a best-first heap algorithm from the block diffusion drafter's per-position distributions, verified in parallel with an ancestor-only attention mask.
If this is right
- Longer average accepted sequences per verification round compared to single-trajectory methods.
- Improved overall speedup in speculative decoding without modifying the underlying drafter.
- Ability to leverage the block diffusion model's full output distributions for better branch selection.
- Placement among top-performing speculative decoding techniques due to building on DFlash.
Where Pith is reading between the lines
- The method might extend to other types of non-autoregressive drafters beyond block diffusion.
- Tree verification could be combined with other acceleration techniques like quantization or pruning.
- Further gains may come from optimizing the heap selection or mask for specific model architectures.
Load-bearing premise
That the draft model's per-position distributions serve as a good surrogate for ranking which continuations the target model will accept, and that the ancestor-only attention mask enables accurate parallel verification without hidden dependencies.
What would settle it
A direct comparison experiment where the acceptance length and accuracy of DDTree-verified tokens is measured against sequential verification of the same branches; if the tree method shows lower acceptance or incorrect tokens due to the mask, the claim fails.
Figures
read the original abstract
Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DDTree (Diffusion Draft Tree) as an extension to the block-diffusion drafter DFlash for speculative decoding. It constructs a draft tree from the drafter's per-position output distributions by applying a best-first heap selection under a fixed node budget, using draft probabilities as a surrogate for target acceptance. The resulting tree is then verified in a single target-model forward pass via an ancestor-only attention mask. The central claim is that this yields longer expected acceptance lengths than vanilla single-trajectory DFlash while remaining computationally efficient, thereby placing DDTree among the leading speculative-decoding methods.
Significance. If the empirical results and correctness arguments hold, the work would be a meaningful incremental advance in speculative decoding: it shows how to convert a strong block-diffusion drafter into a tree-structured proposer without extra target passes, potentially increasing throughput under the same node budget. The approach is conceptually clean and directly leverages an existing high-performing drafter.
major comments (3)
- [Abstract] Abstract: the claim that DDTree 'achieves state-of-the-art speculative decoding performance' and 'outperforms EAGLE-3' is unsupported by any quantitative results, acceptance-length tables, speed-up numbers, or experimental protocol. Without these data the central claim cannot be evaluated.
- [Method] Method description (implicit in abstract and §3): the paper asserts that the best-first selection using draft per-position probabilities as surrogate produces trees with higher expected acceptance length than DFlash's single trajectory, yet supplies no correlation analysis, logit-matching checks, or ablation showing that the surrogate ranking reliably predicts target acceptance decisions despite model mismatch.
- [Verification] Verification procedure (ancestor-only attention mask): the manuscript states that the mask 'permits correct parallel verification' in one forward pass, but provides no explicit validation (e.g., output-distribution equivalence test or hidden-state leakage check) that the mask strictly restricts attention to ancestors without cross-branch leakage or alteration of the target's output distribution relative to sequential verification.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the presentation of results, methodological justifications, and correctness arguments.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that DDTree 'achieves state-of-the-art speculative decoding performance' and 'outperforms EAGLE-3' is unsupported by any quantitative results, acceptance-length tables, speed-up numbers, or experimental protocol. Without these data the central claim cannot be evaluated.
Authors: We agree that the abstract would be clearer with supporting quantitative highlights. The full manuscript (Section 4) contains the relevant tables and experimental protocol showing acceptance lengths, throughput, and direct comparisons to EAGLE-3 and other baselines. We will revise the abstract to include key numbers (e.g., average accepted tokens and speedup factors) and a brief reference to the evaluation setup. revision: yes
-
Referee: [Method] Method description (implicit in abstract and §3): the paper asserts that the best-first selection using draft per-position probabilities as surrogate produces trees with higher expected acceptance length than DFlash's single trajectory, yet supplies no correlation analysis, logit-matching checks, or ablation showing that the surrogate ranking reliably predicts target acceptance decisions despite model mismatch.
Authors: The surrogate ranking is motivated by the established correlation between drafter probabilities and target acceptance in speculative decoding literature. While the initial submission did not contain an explicit correlation or ablation study, we will add one in the revision: an analysis on held-out data measuring the correlation between draft probabilities and actual acceptance, plus a direct comparison of expected acceptance length for the best-first tree versus DFlash's single trajectory under the same node budget. revision: yes
-
Referee: [Verification] Verification procedure (ancestor-only attention mask): the manuscript states that the mask 'permits correct parallel verification' in one forward pass, but provides no explicit validation (e.g., output-distribution equivalence test or hidden-state leakage check) that the mask strictly restricts attention to ancestors without cross-branch leakage or alteration of the target's output distribution relative to sequential verification.
Authors: We acknowledge that an explicit empirical or analytical validation of the mask would strengthen the correctness argument. In the revised manuscript we will add a short validation subsection that reports an output-distribution equivalence test (comparing logits from the masked parallel pass against sequential verification on sample sequences) to confirm absence of cross-branch leakage. revision: yes
Circularity Check
Minor self-citation to prior DFlash work that is not load-bearing for the new algorithmic claims
full rationale
The paper presents DDTree as an algorithmic extension of the DFlash block diffusion drafter: it applies a best-first heap to select tree nodes from the drafter's per-position output distributions and uses an ancestor-only attention mask for parallel verification in one target forward pass. No equations, fitted parameters, or self-definitional reductions are shown that would make the claimed acceptance-length gains equivalent to the inputs by construction. The reference to DFlash as a 'leading draft model' is a self-citation, but the central derivation (tree construction and mask) introduces independent content and does not rely on the citation to forbid alternatives or force the result. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- node budget
axioms (1)
- domain assumption Ancestor-only attention mask permits correct parallel verification of all tree paths in a single forward pass.
invented entities (1)
-
DDTree (Diffusion Draft Tree)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning. PMLR, 2023
2023
-
[2]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
Block diffusion: Interpolating between autoregressive and diffusion language models
Marianne Arriola, Subham Sekhar Sahoo, Aaron Gokaslan, Zhihan Yang, Zhixuan Qi, Jiaqi Han, 12 Justin T Chiu, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InInternational Conference on Learning Representations, 2025
2025
-
[4]
Dflash: Block diffusion for flash speculative decoding
Jian Chen, Yesheng Liang, and Zhijian Liu. DFlash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026
-
[5]
EAGLE-3: Scaling up inference accelerationoflargelanguagemodelsviatraining-timetest
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference accelerationoflargelanguagemodelsviatraining-timetest. InConference on Neural Information Processing Systems, 2025
2025
-
[6]
Opt-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025
Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025
2025
-
[7]
Accelerating LLM inference with staged speculative decoding
Benjamin Frederick Spector and Christopher Re. Accelerating LLM inference with staged speculative decoding. InWorkshop on Efficient Systems for Foundation Models, ICML, 2023
2023
-
[8]
Recursive speculative decoding: Accelerating LLM inference via sampling without replacement
Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, and Christopher Lott. Recursive speculative decoding: Accelerating LLM inference via sampling without replacement. InWorkshop on Large Language Model (LLM) Agents, ICLR, 2024
2024
-
[9]
Dyspec: Faster speculative decoding with dynamic token tree structure.World Wide Web, 28(3):36, 2025
Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, and Lei Zou. Dyspec: Faster speculative decoding with dynamic token tree structure.World Wide Web, 28(3):36, 2025
2025
-
[10]
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification.ACM International Conference on ...
2023
-
[11]
Lee, Deming Chen, and Tri Dao
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. In International Conference on Machine Learning, 2024
2024
-
[12]
EAGLE: Speculative sampling requires rethinking feature uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InInternational Conference on Machine Learning, 2024
2024
-
[13]
Eagle-2: Faster inference of language models with dynamic draft trees
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. InConference on empirical methods in natural language processing, pages 7421–7432, 2024
2024
-
[14]
PARD: Accelerating LLM inference with low-cost PARallel draft model adaptation
Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. PARD: Accelerating LLM inference with low-cost PARallel draft model adaptation. InInternational Conference on Learning Representations, 2026
2026
-
[15]
Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, and Chen Tian. Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026
-
[16]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng 13 Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Jingren Zhou, Junyan Lin, Kai Dang, Keqin Bao, Ke-Pei...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Let’s verify step by step
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International conference on learning representations, 2023
2023
-
[18]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, Phi...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, 2025
2025
-
[22]
SWE-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024
2024
-
[23]
Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 2023
2023
-
[24]
Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models., 3(6):7, 2023
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models., 3(6):7, 2023
2023
-
[25]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024. 14 A Mathematical proofs Proof of Proposition 1.We expand αT (Y1:L) = L∑ d=1 1{αT (Y1:L)≥d}.(9) For a fixed depthd, the event{αT (Y1:L)≥d}holds if and only if the sampled depth-d prefixY1:d is one of the...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.