D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

Binhang Yuan; Jiahao Zhang; Jiarui Zhang; Jinwei Yao; Liyuan Zhang; Ran Yan; Tongkai Yang; Yi Wu; Yuchen Yang

arxiv: 2606.04446 · v1 · pith:HYIWB57Pnew · submitted 2026-06-03 · 💻 cs.DC · cs.LG

D²SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

Liyuan Zhang , Jiarui Zhang , Jinwei Yao , Ran Yan , Yuchen Yang , Jiahao Zhang , Tongkai Yang , Yi Wu

show 1 more author

Binhang Yuan

This is my paper

Pith reviewed 2026-06-28 04:48 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords speculative decodingdiffusion modelslarge language modelsinference accelerationprefix treecascade attentiondraft model

0 comments

The pith

Dual diffusion drafters with a confidence-guided prefix tree raise the number of accepted tokens per verification step in speculative decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem that a single diffusion-generated draft sequence is discarded entirely after the first mismatch, limiting acceptance rates in speculative decoding. It proposes a framework that runs a first diffusion drafter to produce tokens plus per-position confidence scores, uses those scores to locate the probable rejection boundary, and selects top-K prefixes for recovery. A second variable-prefix diffusion drafter then generates alternative continuations from those prefixes in one batched call. The resulting set of shared-prefix candidates is verified together with cascade attention, producing more accepted tokens without a proportional increase in drafting or verification cost.

Core claim

D^2SD organizes candidates into a confidence-guided prefix tree: the first diffusion drafter generates a block along with per-position confidence scores to identify the most likely rejection boundary and select top-K prefix ranges; the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention, yielding higher acceptance rates than both the underlying single diffusion approach and strong autoregressive speculative decoding baselines.

What carries the argument

The dual diffusion draft pair with confidence-guided prefix tree selection and cascade attention verification.

If this is right

More tokens are accepted per target-model forward pass while the added drafting cost remains controlled by prefix sharing.
Early rejection points no longer force complete discard of the remaining draft block.
The same verification budget accepts longer effective sequences than single-sequence diffusion drafts.
Naive batching of independent drafts is replaced by structured recovery that reduces redundant computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The prefix-tree recovery pattern could apply to other parallel drafters that output confidence estimates.
If the confidence scores correlate with actual acceptance, the method may reduce the need for deeper tree search in speculative decoding.
The two-drafter split suggests a general trade-off between initial coverage and targeted recovery that could be tuned by varying K.

Load-bearing premise

The per-position confidence scores from the first diffusion drafter can reliably identify the most likely rejection boundary and the second variable-prefix diffusion drafter can propose effective alternative continuations.

What would settle it

Measure whether acceptance rate falls when the second drafter is replaced by random or fixed continuations from the selected prefixes, or when prefix selection ignores the confidence scores.

read the original abstract

Speculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of tokens in parallel but usually commit to a single draft sequence per verification: once the first mismatch occurs, all subsequent draft tokens are discarded, resulting in a limited acceptance rate. Naively batching more draft candidate sequences only introduces a marginal improvement, as redundant or poorly placed branches increase the cost of drafting and verification without proportionally increasing the number of accepted tokens. We propose D^2SD, a dual diffusion draft speculative decoding framework that organizes candidates into a confidence-guided prefix tree, where the first diffusion drafter generates a block along with per-position confidence scores that are used to identify the most likely rejection boundary and select the top-K prefix ranges for recovery; the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention. Empirically, D^2SD shows clear improvements over both the underlying diffusion approach and strong autoregressive speculative decoding baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D^2SD adds a dual-diffusion prefix tree to speculative decoding but the abstract supplies no results and the confidence-to-rejection link looks shaky.

read the letter

The punchline on this one is that D^2SD organizes two diffusion drafters into a confidence-guided prefix tree with cascade verification to generate better draft candidates for speculative decoding. It directly targets the single-sequence commitment problem in diffusion drafters.

The new element is the way the first drafter's per-position scores pick top-K prefixes for the second drafter to continue from, all in one batched pass, then verified jointly. This goes beyond just running multiple independent drafts. The cascade attention part sounds like a way to share computation on the common prefixes.

It does a good job explaining the limitations of prior approaches: committing to one sequence wastes later tokens on mismatch, and naive batching adds overhead without smart placement of branches. The dual setup with variable prefixes is a logical next step for that.

Where it gets soft is the empirical side. The abstract asserts clear improvements but includes no numbers, no tables, no details on models, datasets, or even the size of the gains. That makes it hard to assess if the method actually delivers or if it's sensitive to particular setups. The stress-test note raises a fair point about the confidence scores. Since diffusion generates everything at once, those scores are calculated afterward rather than from sequential probabilities. Without evidence that they align with where the target model actually rejects tokens, the prefix selection could be no better than random or fixed branching. The paper should include some validation of that mapping.

This paper is for people in the LLM systems community who care about inference speed. A reader looking for new draft organization tricks might find the tree idea worth trying. It is worth sending to peer review because the core idea is well-motivated and the engineering is concrete, even though the writeup needs more data to stand on its own.

Recommendation: Yes, put it through review and ask specifically for the experimental results and any analysis on the confidence scores' predictive power.

Referee Report

2 major / 1 minor

Summary. The paper introduces D^2SD, a dual diffusion draft speculative decoding framework. The first diffusion drafter generates a token block along with per-position confidence scores used to identify the most likely rejection boundary and select top-K prefix ranges; the second variable-prefix diffusion drafter re-anchors at each selected prefix to propose alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention. The abstract asserts that this yields clear empirical improvements over the underlying diffusion approach and strong autoregressive speculative decoding baselines.

Significance. If the empirical gains are validated with proper metrics and controls, the method could improve acceptance rates in diffusion-based speculative decoding by using confidence-guided branching to recover from early mismatches without the overhead of naive multi-draft batching, addressing a practical limitation in parallel draft generation for LLM inference acceleration.

major comments (2)

Abstract: the central claim that 'D^2SD shows clear improvements' supplies no metrics, experimental details, error bars, or verification of the results, which is load-bearing for an empirical engineering contribution and prevents any assessment of whether the dual-drafter mechanism supports the stated gains.
Method description (dual diffusion draft): the core mechanism assumes per-position confidence scores from the first diffusion drafter reliably identify the rejection boundary for top-K prefix selection. Diffusion models generate the full block in a single parallel forward pass, so these scores are post-hoc estimates; without a derivation, calibration, or evidence that they correlate with actual target-model verification rejections, the selected prefixes may not yield higher acceptance rates than a single diffusion draft or AR baseline.

minor comments (1)

The abstract would be strengthened by briefly indicating the models, datasets, or hardware used to obtain the claimed empirical results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating planned revisions to strengthen the empirical presentation and methodological justification.

read point-by-point responses

Referee: Abstract: the central claim that 'D^2SD shows clear improvements' supplies no metrics, experimental details, error bars, or verification of the results, which is load-bearing for an empirical engineering contribution and prevents any assessment of whether the dual-drafter mechanism supports the stated gains.

Authors: We agree the abstract would benefit from quantitative anchors. The full manuscript reports acceptance rates, wall-clock speedups, and comparisons against diffusion and AR baselines with standard error bars across multiple runs and models. In revision we will expand the abstract to cite the primary gains (e.g., relative acceptance-rate lift and tokens-per-step improvement) while remaining within length limits. revision: yes
Referee: Method description (dual diffusion draft): the core mechanism assumes per-position confidence scores from the first diffusion drafter reliably identify the rejection boundary for top-K prefix selection. Diffusion models generate the full block in a single parallel forward pass, so these scores are post-hoc estimates; without a derivation, calibration, or evidence that they correlate with actual target-model verification rejections, the selected prefixes may not yield higher acceptance rates than a single diffusion draft or AR baseline.

Authors: The scores are the per-position softmax probabilities produced by the diffusion drafter during its single parallel pass. While we do not supply a closed-form derivation linking these probabilities to target-model rejection locations, the end-to-end experiments demonstrate that the resulting prefix tree yields measurably higher acceptance rates than the single-draft diffusion baseline. We will add an expanded paragraph in the method section explaining the heuristic rationale and include an ablation that isolates the effect of the top-K prefix selection. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical engineering proposal for dual diffusion speculative decoding that organizes candidates via confidence-guided prefix trees and cascade attention. No load-bearing derivation, equation, or uniqueness claim reduces to its own inputs by construction, self-citation, or fitted-parameter renaming. The method extends prior diffusion and attention components with new heuristics whose validity is assessed through external benchmarks rather than internal redefinition. All central mechanisms remain falsifiable outside the fitted values of the present work.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the method relies on standard components from prior diffusion and attention literature with likely untuned hyperparameters such as top-K selection.

free parameters (1)

top-K
Number of prefix ranges selected for recovery by the second drafter; value not specified but required for the framework to function.

pith-pipeline@v0.9.1-grok · 5759 in / 995 out tokens · 27370 ms · 2026-06-28T04:48:58.062400+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 14 canonical work pages · 10 internal anchors

[1]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

2023
[5]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Acceler- ating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Lan...

2024
[7]

Sequoia: Scalable, robust, and hardware-aware speculative decoding

Zhuoming Chen, Avner May , Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding. arXiv preprint arXiv:2402.12374, 2024

work page arXiv 2024
[8]

Eagle-3: Scaling up inference acceleration of large language models via training-time test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[9]

DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding. arXiv preprint arXiv:2602.06036, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Fasttree: Optimizing attention kernel and runtime for tree-structured llm inference

Zaifeng Pan, Yitong Ding, Yue Guan, Zheng Wang, Zhongkai Yu, Xulong Tang, Yida Wang, and Yufei Ding. Fasttree: Optimizing attention kernel and runtime for tree-structured llm inference. Proceedings of Machine Learning and Systems, 7, 2025

2025
[11]

Deft: Decoding with flash tree-attention for efficient tree-structured llm inference

Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, and Tao Lin. Deft: Decoding with flash tree-attention for efficient tree-structured llm inference. In 13th International Conference on Learning Representations, ICLR 2025, pages 3587–3618. International Conference on Learning Representations, ICLR, 2025

2025
[12]

Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding

Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May , Tianqi Chen, and Beidi Chen. Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding. In International Conference on Learning Representations (ICLR), 2025

2025
[13]

Glide with a cape: A low-hassle method to accelerate speculative decoding

Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, et al. Glide with a cape: A low-hassle method to accelerate speculative decoding. In International Conference on Machine Learning, pages 11704–11720. PMLR, 2024. 11

2024
[14]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 7421–7432, 2024

2024
[15]

Medusa: Simple llm inference acceleration framework with multiple decoding heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. In International Conference on Machine Learning, pages 5209–5235. PMLR, 2024

2024
[16]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

2021
[18]

American Invitational Mathematics Examination – AIME 2025

Mathematical Association of America. American Invitational Mathematics Examination – AIME 2025. https: //maa.org/maa-invitational-competitions, February 2025. Accessed: 2026-05-06

2025
[19]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry , Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. 2025:58791–58831, 2025

2025
[22]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023

2023
[23]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_ alpaca, 2023

2023
[24]

The perfect blend: Redefining rlhf with mixture of judges, 2024

Tengyu Xu, Eryk Helenowski, Karthik Abinav Sankararaman, Di Jin, Kaiyan Peng, Eric Han, Shaoliang Nie, Chen Zhu, Hejia Zhang, Wenxuan Zhou, et al. The perfect blend: Redefining rlhf with mixture of judges, 2024

2024
[25]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In 12th International Conference on Learning Representations, ICLR 2024, 2024

2024
[26]

Flashinfer: Efficient and customizable attention engine for llm inference serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy , et al. Flashinfer: Efficient and customizable attention engine for llm inference serving. Proceedings of Machine Learning and Systems, 7, 2025

2025
[27]

Eagle: speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: speculative sampling requires rethinking feature uncertainty . InProceedings of the 41st International Conference on Machine Learning, pages 28935–28948, 2024

2024
[28]

Break the sequential dependency of llm inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. In International Conference on Machine Learning, pages 14060–14079. PMLR, 2024

2024
[29]

Online speculative decoding

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. Online speculative decoding. arXiv preprint arXiv:2310.07177, 2023

work page arXiv 2023
[30]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. arXiv preprint arXiv:2505.19223, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, et al. Llada2. 1: Speeding up text diffusion via token editing. arXiv preprint arXiv:2602.08676, 2026. 12

work page arXiv 2026
[32]

Diffuspec: Unlocking diffusion language models for speculative decoding

Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. Diffuspec: Unlocking diffusion language models for speculative decoding. arXiv preprint arXiv:2510.02358, 2025

work page arXiv 2025
[33]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation, 2025

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. Pard: Accelerating llm inference with low-cost parallel draft model adaptation, 2025

2025
[34]

Accelerating Speculative Decoding with Block Diffusion Draft Trees

Liran Ringel and Yaniv Romano. Accelerating speculative decoding with block diffusion draft trees. arXiv preprint arXiv:2604.12989, 2026. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

2023

[5] [5]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Acceler- ating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Lan...

2024

[7] [7]

Sequoia: Scalable, robust, and hardware-aware speculative decoding

Zhuoming Chen, Avner May , Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding. arXiv preprint arXiv:2402.12374, 2024

work page arXiv 2024

[8] [8]

Eagle-3: Scaling up inference acceleration of large language models via training-time test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[9] [9]

DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding. arXiv preprint arXiv:2602.06036, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Fasttree: Optimizing attention kernel and runtime for tree-structured llm inference

Zaifeng Pan, Yitong Ding, Yue Guan, Zheng Wang, Zhongkai Yu, Xulong Tang, Yida Wang, and Yufei Ding. Fasttree: Optimizing attention kernel and runtime for tree-structured llm inference. Proceedings of Machine Learning and Systems, 7, 2025

2025

[11] [11]

Deft: Decoding with flash tree-attention for efficient tree-structured llm inference

Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, and Tao Lin. Deft: Decoding with flash tree-attention for efficient tree-structured llm inference. In 13th International Conference on Learning Representations, ICLR 2025, pages 3587–3618. International Conference on Learning Representations, ICLR, 2025

2025

[12] [12]

Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding

Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May , Tianqi Chen, and Beidi Chen. Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding. In International Conference on Learning Representations (ICLR), 2025

2025

[13] [13]

Glide with a cape: A low-hassle method to accelerate speculative decoding

Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, et al. Glide with a cape: A low-hassle method to accelerate speculative decoding. In International Conference on Machine Learning, pages 11704–11720. PMLR, 2024. 11

2024

[14] [14]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 7421–7432, 2024

2024

[15] [15]

Medusa: Simple llm inference acceleration framework with multiple decoding heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. In International Conference on Machine Learning, pages 5209–5235. PMLR, 2024

2024

[16] [16]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

2021

[18] [18]

American Invitational Mathematics Examination – AIME 2025

Mathematical Association of America. American Invitational Mathematics Examination – AIME 2025. https: //maa.org/maa-invitational-competitions, February 2025. Accessed: 2026-05-06

2025

[19] [19]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[20] [20]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry , Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. 2025:58791–58831, 2025

2025

[22] [22]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023

2023

[23] [23]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_ alpaca, 2023

2023

[24] [24]

The perfect blend: Redefining rlhf with mixture of judges, 2024

Tengyu Xu, Eryk Helenowski, Karthik Abinav Sankararaman, Di Jin, Kaiyan Peng, Eric Han, Shaoliang Nie, Chen Zhu, Hejia Zhang, Wenxuan Zhou, et al. The perfect blend: Redefining rlhf with mixture of judges, 2024

2024

[25] [25]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In 12th International Conference on Learning Representations, ICLR 2024, 2024

2024

[26] [26]

Flashinfer: Efficient and customizable attention engine for llm inference serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy , et al. Flashinfer: Efficient and customizable attention engine for llm inference serving. Proceedings of Machine Learning and Systems, 7, 2025

2025

[27] [27]

Eagle: speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: speculative sampling requires rethinking feature uncertainty . InProceedings of the 41st International Conference on Machine Learning, pages 28935–28948, 2024

2024

[28] [28]

Break the sequential dependency of llm inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. In International Conference on Machine Learning, pages 14060–14079. PMLR, 2024

2024

[29] [29]

Online speculative decoding

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. Online speculative decoding. arXiv preprint arXiv:2310.07177, 2023

work page arXiv 2023

[30] [30]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. arXiv preprint arXiv:2505.19223, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, et al. Llada2. 1: Speeding up text diffusion via token editing. arXiv preprint arXiv:2602.08676, 2026. 12

work page arXiv 2026

[32] [32]

Diffuspec: Unlocking diffusion language models for speculative decoding

Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. Diffuspec: Unlocking diffusion language models for speculative decoding. arXiv preprint arXiv:2510.02358, 2025

work page arXiv 2025

[33] [33]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation, 2025

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. Pard: Accelerating llm inference with low-cost parallel draft model adaptation, 2025

2025

[34] [34]

Accelerating Speculative Decoding with Block Diffusion Draft Trees

Liran Ringel and Yaniv Romano. Accelerating speculative decoding with block diffusion draft trees. arXiv preprint arXiv:2604.12989, 2026. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026