arxiv: 2604.12989 · v1 · submitted 2026-04-14 · 💻 cs.CL

Recognition: unknown

Accelerating Speculative Decoding with Block Diffusion Draft Trees

Liran Ringel , Yaniv Romano

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:00 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculative decodingblock diffusiondraft treediffusion draft treeDDTreelanguage model inferenceautoregressive decoding

0 comments

The pith

DDTree constructs a draft tree from a block diffusion drafter's distributions to verify multiple trajectories in one target pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to extend block diffusion drafting for speculative decoding by building a tree of possible token sequences instead of a single chain. It selects the most promising branches using the drafter's own probability outputs and a best-first algorithm limited by node count. These branches are then checked together by the target model in a single forward pass thanks to a special attention mask that only looks at ancestors. This approach aims to accept more tokens per round than standard single-trajectory drafting while keeping the drafter fast. Readers interested in efficient LLM inference would care because it could reduce the number of expensive target model calls needed to generate text.

Core claim

We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, these gains place DDTree among the leading approaches to speculative decoding.

What carries the argument

The draft tree built by a best-first heap algorithm from the block diffusion drafter's per-position distributions, verified in parallel with an ancestor-only attention mask.

If this is right

Longer average accepted sequences per verification round compared to single-trajectory methods.
Improved overall speedup in speculative decoding without modifying the underlying drafter.
Ability to leverage the block diffusion model's full output distributions for better branch selection.
Placement among top-performing speculative decoding techniques due to building on DFlash.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might extend to other types of non-autoregressive drafters beyond block diffusion.
Tree verification could be combined with other acceleration techniques like quantization or pruning.
Further gains may come from optimizing the heap selection or mask for specific model architectures.

Load-bearing premise

That the draft model's per-position distributions serve as a good surrogate for ranking which continuations the target model will accept, and that the ancestor-only attention mask enables accurate parallel verification without hidden dependencies.

What would settle it

A direct comparison experiment where the acceptance length and accuracy of DDTree-verified tokens is measured against sequential verification of the same branches; if the tree method shows lower acceptance or incorrect tokens due to the mask, the claim fails.

Figures

Figures reproduced from arXiv: 2604.12989 by Liran Ringel, Yaniv Romano.

**Figure 2.** Figure 2: Illustration of one DDTree decoding round. The bonus token [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Budget tradeoff on MATH-500 with Qwen3-8B at temperature 0.0. Acceptance length [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Acceptance length distribution on MATH-500 with Qwen3-8B at temperature 0.0. The [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DDTree turns DFlash's block outputs into a best-first draft tree verified under an ancestor mask, but the gains over single-trajectory DFlash rest on untested assumptions about surrogate ranking and mask correctness.

read the letter

The paper's core move is to take the per-position distributions from a block diffusion drafter and feed them into a simple heap that grows a tree of candidate continuations under a node budget, then verify the whole tree in one target forward pass. The ancestor-only attention mask is meant to keep each node seeing only its path while still allowing parallel computation. This is a direct, low-overhead way to expand beyond the single trajectory that vanilla DFlash uses, and it reuses the drafter's existing output without extra passes. That part is cleanly described and feels like a natural next step once you have per-position probabilities instead of a single autoregressive chain. The writing stays focused on the algorithm and the verification trick, which is helpful for anyone who has already implemented DFlash or similar block methods. The practical upside is real if the tree actually delivers longer accepted prefixes on average; the node budget framing makes it easy to compare against other tree-based speculative schemes. The soft spots sit in the validation. The abstract asserts SOTA results and outperformance of EAGLE-3, yet the provided details give no acceptance-length numbers, no correlation between draft probabilities and target acceptance, and no check that the ancestor mask preserves the target's output distribution or blocks cross-branch leakage. Without those measurements, it is hard to know whether the surrogate ranking is doing useful work or whether the mask is silently changing the verification. A minor additional gap is the lack of ablations on heap selection versus other tree-construction rules under the same budget. These are fixable with standard experiments rather than fundamental problems. The work is aimed at researchers and engineers already working on speculative decoding for LLM serving. Someone who has run EAGLE-style or diffusion drafters will immediately see how to slot this in and test it. It deserves a serious referee because the algorithmic contribution is concrete and the latency stakes are high enough that even modest, well-measured gains matter. I would send it for review and ask specifically for the acceptance-length tables and mask validation checks.

Referee Report

3 major / 0 minor

Summary. The paper proposes DDTree (Diffusion Draft Tree) as an extension to the block-diffusion drafter DFlash for speculative decoding. It constructs a draft tree from the drafter's per-position output distributions by applying a best-first heap selection under a fixed node budget, using draft probabilities as a surrogate for target acceptance. The resulting tree is then verified in a single target-model forward pass via an ancestor-only attention mask. The central claim is that this yields longer expected acceptance lengths than vanilla single-trajectory DFlash while remaining computationally efficient, thereby placing DDTree among the leading speculative-decoding methods.

Significance. If the empirical results and correctness arguments hold, the work would be a meaningful incremental advance in speculative decoding: it shows how to convert a strong block-diffusion drafter into a tree-structured proposer without extra target passes, potentially increasing throughput under the same node budget. The approach is conceptually clean and directly leverages an existing high-performing drafter.

major comments (3)

[Abstract] Abstract: the claim that DDTree 'achieves state-of-the-art speculative decoding performance' and 'outperforms EAGLE-3' is unsupported by any quantitative results, acceptance-length tables, speed-up numbers, or experimental protocol. Without these data the central claim cannot be evaluated.
[Method] Method description (implicit in abstract and §3): the paper asserts that the best-first selection using draft per-position probabilities as surrogate produces trees with higher expected acceptance length than DFlash's single trajectory, yet supplies no correlation analysis, logit-matching checks, or ablation showing that the surrogate ranking reliably predicts target acceptance decisions despite model mismatch.
[Verification] Verification procedure (ancestor-only attention mask): the manuscript states that the mask 'permits correct parallel verification' in one forward pass, but provides no explicit validation (e.g., output-distribution equivalence test or hidden-state leakage check) that the mask strictly restricts attention to ancestors without cross-branch leakage or alteration of the target's output distribution relative to sequential verification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the presentation of results, methodological justifications, and correctness arguments.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that DDTree 'achieves state-of-the-art speculative decoding performance' and 'outperforms EAGLE-3' is unsupported by any quantitative results, acceptance-length tables, speed-up numbers, or experimental protocol. Without these data the central claim cannot be evaluated.

Authors: We agree that the abstract would be clearer with supporting quantitative highlights. The full manuscript (Section 4) contains the relevant tables and experimental protocol showing acceptance lengths, throughput, and direct comparisons to EAGLE-3 and other baselines. We will revise the abstract to include key numbers (e.g., average accepted tokens and speedup factors) and a brief reference to the evaluation setup. revision: yes
Referee: [Method] Method description (implicit in abstract and §3): the paper asserts that the best-first selection using draft per-position probabilities as surrogate produces trees with higher expected acceptance length than DFlash's single trajectory, yet supplies no correlation analysis, logit-matching checks, or ablation showing that the surrogate ranking reliably predicts target acceptance decisions despite model mismatch.

Authors: The surrogate ranking is motivated by the established correlation between drafter probabilities and target acceptance in speculative decoding literature. While the initial submission did not contain an explicit correlation or ablation study, we will add one in the revision: an analysis on held-out data measuring the correlation between draft probabilities and actual acceptance, plus a direct comparison of expected acceptance length for the best-first tree versus DFlash's single trajectory under the same node budget. revision: yes
Referee: [Verification] Verification procedure (ancestor-only attention mask): the manuscript states that the mask 'permits correct parallel verification' in one forward pass, but provides no explicit validation (e.g., output-distribution equivalence test or hidden-state leakage check) that the mask strictly restricts attention to ancestors without cross-branch leakage or alteration of the target's output distribution relative to sequential verification.

Authors: We acknowledge that an explicit empirical or analytical validation of the mask would strengthen the correctness argument. In the revised manuscript we will add a short validation subsection that reports an output-distribution equivalence test (comparing logits from the masked parallel pass against sequential verification on sample sequences) to confirm absence of cross-branch leakage. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to prior DFlash work that is not load-bearing for the new algorithmic claims

full rationale

The paper presents DDTree as an algorithmic extension of the DFlash block diffusion drafter: it applies a best-first heap to select tree nodes from the drafter's per-position output distributions and uses an ancestor-only attention mask for parallel verification in one target forward pass. No equations, fitted parameters, or self-definitional reductions are shown that would make the claimed acceptance-length gains equivalent to the inputs by construction. The reference to DFlash as a 'leading draft model' is a self-citation, but the central derivation (tree construction and mask) introduces independent content and does not rely on the citation to forbid alternatives or force the result. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unstated assumption that the drafter's output distributions form a useful surrogate for target-model acceptance and that the ancestor-only mask correctly implements tree verification; no free parameters or invented physical entities are described.

free parameters (1)

node budget
Fixed budget on the number of nodes in the draft tree; its value is chosen but not specified in the abstract.

axioms (1)

domain assumption Ancestor-only attention mask permits correct parallel verification of all tree paths in a single forward pass.
Invoked to justify efficient verification; no justification or reference supplied in abstract.

invented entities (1)

DDTree (Diffusion Draft Tree) no independent evidence
purpose: Data structure that organizes block-diffusion draft tokens into a tree for multi-trajectory verification.
New algorithmic construct introduced by the paper.

pith-pipeline@v0.9.0 · 5480 in / 1358 out tokens · 36792 ms · 2026-05-10T16:00:27.107658+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 7 canonical work pages · 5 internal anchors

[1]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning. PMLR, 2023

2023
[2]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review arXiv 2023
[3]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Subham Sekhar Sahoo, Aaron Gokaslan, Zhihan Yang, Zhixuan Qi, Jiaqi Han, 12 Justin T Chiu, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InInternational Conference on Learning Representations, 2025

2025
[4]

Dflash: Block diffusion for flash speculative decoding

Jian Chen, Yesheng Liang, and Zhijian Liu. DFlash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page arXiv 2026
[5]

EAGLE-3: Scaling up inference accelerationoflargelanguagemodelsviatraining-timetest

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference accelerationoflargelanguagemodelsviatraining-timetest. InConference on Neural Information Processing Systems, 2025

2025
[6]

Opt-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025

Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025

2025
[7]

Accelerating LLM inference with staged speculative decoding

Benjamin Frederick Spector and Christopher Re. Accelerating LLM inference with staged speculative decoding. InWorkshop on Efficient Systems for Foundation Models, ICML, 2023

2023
[8]

Recursive speculative decoding: Accelerating LLM inference via sampling without replacement

Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, and Christopher Lott. Recursive speculative decoding: Accelerating LLM inference via sampling without replacement. InWorkshop on Large Language Model (LLM) Agents, ICLR, 2024

2024
[9]

Dyspec: Faster speculative decoding with dynamic token tree structure.World Wide Web, 28(3):36, 2025

Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, and Lei Zou. Dyspec: Faster speculative decoding with dynamic token tree structure.World Wide Web, 28(3):36, 2025

2025
[10]

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification.ACM International Conference on ...

2023
[11]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. In International Conference on Machine Learning, 2024

2024
[12]

EAGLE: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InInternational Conference on Machine Learning, 2024

2024
[13]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. InConference on empirical methods in natural language processing, pages 7421–7432, 2024

2024
[14]

PARD: Accelerating LLM inference with low-cost PARallel draft model adaptation

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. PARD: Accelerating LLM inference with low-cost PARallel draft model adaptation. InInternational Conference on Learning Representations, 2026

2026
[15]

Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, and Chen Tian. Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

work page arXiv 2026
[16]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng 13 Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Jingren Zhou, Junyan Lin, Kai Dang, Keqin Bao, Ke-Pei...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International conference on learning representations, 2023

2023
[18]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, Phi...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, 2025

2025
[22]

SWE-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

2024
[23]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 2023

2023
[24]

Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models., 3(6):7, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models., 3(6):7, 2023

2023
[25]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024. 14 A Mathematical proofs Proof of Proposition 1.We expand αT (Y1:L) = L∑ d=1 1{αT (Y1:L)≥d}.(9) For a fixed depthd, the event{αT (Y1:L)≥d}holds if and only if the sampled depth-d prefixY1:d is one of the...

2024