FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Biqing Qi; Jianuo Huang; Junlong Ke; Linfeng Zhang; Tianchen Zhao; Yaojie Zhang; Yongji Long; Yuhang Han

arxiv: 2605.20022 · v1 · pith:FGPPAUVVnew · submitted 2026-05-19 · 💻 cs.CL

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Yaojie Zhang , Jianuo Huang , Junlong Ke , Yuhang Han , Yongji Long , Tianchen Zhao , Biqing Qi , Linfeng Zhang This is my paper

Pith reviewed 2026-05-20 05:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculative decodingLLM inference accelerationattention tuningbatch size adaptationlossless decodingbonus token calibrationparallel verification

0 comments

The pith

FlexDraft enables lossless speculative decoding that adapts to any batch size by tuning attention and calibrating bonus tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speculative decoding accelerates LLM inference by having a fast drafter propose tokens for the target model to verify in parallel, but parallel versions often lose their speed advantage at large batch sizes due to uncertainty in the bonus token and accepted length. The paper claims FlexDraft fixes this mismatch while staying exactly lossless through three designs that work together. Attention Tuning adjusts only the attention projectors in the final layers on mask tokens to generate high-quality drafts without altering the main autoregressive computation. Bonus-guided Calibration applies a small MLP to correct draft logits once the bonus token is known, and Flex Decoding switches between parallel and sequential modes while varying the verification length based on confidence. If correct, this delivers consistent throughput gains regardless of whether workloads use small or large batches.

Core claim

FlexDraft is a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs: Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters; Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty; Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential at

What carries the argument

Attention Tuning on final-layer projectors using mask tokens, paired with Bonus-guided Calibration via a lightweight MLP on the resolved bonus token and dynamic Flex Decoding mode switching.

If this is right

The target model distribution remains exactly unchanged, guaranteeing lossless generation.
Only a small set of attention parameters need training, keeping overhead low.
Draft verification mismatch from bonus uncertainty is reduced through explicit calibration.
Redundant computation is avoided by switching modes and lengths based on batch size and confidence.
Throughput gains from parallel verification are preserved rather than collapsing at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tuning strategy might allow reuse of the same target model for both drafting and verification in resource-constrained settings.
Similar calibration could address uncertainty in other multi-token prediction schemes beyond speculative decoding.
The dynamic switching logic might generalize to mixed workloads that combine generation with retrieval or tool use.

Load-bearing premise

Tuning only the attention projectors of the final few layers on mask tokens while keeping the autoregressive path frozen preserves the target distribution and produces high quality drafts.

What would settle it

An experiment that measures acceptance rates and end-to-end throughput at large batch sizes and finds them no better than standard sequential speculative decoding or shows any quality drop would disprove the adaptation claim.

Figures

Figures reproduced from arXiv: 2605.20022 by Biqing Qi, Jianuo Huang, Junlong Ke, Linfeng Zhang, Tianchen Zhao, Yaojie Zhang, Yongji Long, Yuhang Han.

**Figure 1.** Figure 1: Limitations of the parallel speculative decoding paradigm. (a) Given the prefix, the target model predicts the next token and prefers “Er”, the first token of “Ernest Hemingway”, which constrains the subsequent generation toward Hemingway. Without access to the bonus token, the draft model tends to favor alternative continuations. (b) Accept length uncertainty forces parallel speculative decoding to consid… view at source ↗

**Figure 2.** Figure 2: Attention masks in FlexDraft. (a) Training. The target performs causal forward to build the clean prefix KV cache. Mask tokens attend bidirectionally within each block and to the prefix, isolated from other blocks. (b) Decoding. Our method supports both parallel and sequential speculative decoding. In parallel mode, the latest draft is verified while candidates for all possible accepted lengths are prepare… view at source ↗

**Figure 3.** Figure 3: Pipeline of FlexDraft. Shallow layers process the clean prefix identically to a standard autoregressive forward pass. In the deep layers, mask tokens are appended to the prefix and routed through trainable attention projectors, enabling parallel draft prediction. Bonus-guided Calibration injects the verified bonus token embedding into a lightweight MLP to adjust draft logits, which improves draft quality. … view at source ↗

**Figure 5.** Figure 5: Execution time of a single draft and verify step. Parallel SD Sequential SD Selected Speedup(Avg) 1.5 2.0 4.0 2.5 3.5 3.0 3 5 8 10 13 Number of Draft Layer [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Ablation analysis of speedup. GSM8k AVG Length : 9.98 HumanEval AVG Length : 7.57 MT-Bench AVG Length : 5.32 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 1600 0 200 1400 600 400 1200 800 Frequency [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead. Parallel speculative decoding addresses this limitation by performing drafting and verification within a single target forward pass, allowing future drafts to be prepared while current candidates are being verified. Although effective at small batch sizes, existing parallel speculative decoding methods either require costly continual pretraining with quality degradation or suffer from low acceptance rates. More importantly, this paradigm inherently suffers from uncertainty in both the bonus token and the accepted length, leading to draft verification mismatch and causing throughput gains to collapse at large batch sizes. To address these limitations, we introduce FlexDraft, a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs. (1) Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters. (2) Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty. (3) Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential draft then verify at large batch sizes, and adjusts verification length based on draft confidence to eliminate redundant computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexDraft adapts speculative decoding to varying batch sizes via targeted attention tuning and bonus calibration, but the lossless claim rests on an assumption about distribution preservation that needs direct checks.

read the letter

FlexDraft is a speculative decoding method that adapts to different batch sizes using attention tuning, bonus calibration, and mode switching. The core idea is to keep the target model intact while adding lightweight adjustments for better drafting in parallel setups. The new parts are the attention tuning limited to the final layers' projectors on mask tokens, which is meant to generate drafts without retraining the whole thing. Then the MLP that uses the bonus token to adjust logits, and the flex decoding that switches between parallel and sequential modes depending on batch size to avoid the collapse at large batches. These choices address the uncertainty in bonus tokens and accepted lengths that hurt performance at large batches. The designs seem practical and build on existing limitations in the field. The soft spot is the preservation of the target distribution. The claim relies on the tuning not affecting non-mask paths, but since attention is involved in all computations, small changes might shift probabilities. The paper would need to demonstrate that the final output matches the original model exactly, perhaps through direct comparison or statistical tests. Overall, this is for people focused on LLM serving and inference optimization. Readers dealing with batch size variability in deployment would get the most from the adaptive strategy. It deserves a serious referee because it targets a genuine deployment issue with specific, implementable ideas. I would recommend engaging with it in peer review, especially to probe the distribution invariance.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FlexDraft, a lossless speculative decoding framework for LLMs that adapts to varying batch sizes. It proposes three designs: (1) Attention Tuning, which tunes only the attention projectors of the final few layers on mask tokens while freezing the autoregressive path to preserve the target distribution and generate high-quality drafts with few parameters; (2) Bonus-guided Calibration, which uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits and reduce verification mismatch; and (3) Flex Decoding, which switches between parallel draft-and-verify at small batches and sequential draft-then-verify at large batches while adjusting verification length by draft confidence.

Significance. If the lossless property and throughput improvements hold across batch sizes, the work would meaningfully advance memory-bound LLM inference by mitigating limitations of prior parallel speculative decoding approaches, such as low acceptance rates and collapse at scale. The minimal-parameter Attention Tuning and dynamic mode switching are practical strengths that could enable broader adoption in production settings.

major comments (2)

[§3.1] §3.1 (Attention Tuning): The lossless guarantee rests on the claim that tuning attention projectors only on mask tokens while freezing the autoregressive path leaves the target distribution unchanged for standard inputs. Because attention projectors participate in every subsequent layer computation, small modifications can propagate to alter hidden-state trajectories and logits unless an explicit isolation mechanism (e.g., a distribution-matching regularizer or architectural mask) is enforced. No such invariance argument or verification is supplied, making the preservation assumption load-bearing for the central lossless claim.
[§5] §5 (Experiments): The reported throughput and acceptance-rate gains at large batch sizes must be accompanied by direct comparisons against both sequential speculative decoding and prior parallel methods, with explicit measurement of draft verification mismatch before and after Bonus-guided Calibration. Without these controls, the flexibility claim across batch sizes remains under-supported.

minor comments (2)

The abstract would be strengthened by a single sentence summarizing the empirical acceptance rates and throughput improvements observed.
[§3.2] Notation for the bonus token and calibrated logits should be introduced consistently in §3.2 to avoid ambiguity when describing the MLP conditioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [§3.1] §3.1 (Attention Tuning): The lossless guarantee rests on the claim that tuning attention projectors only on mask tokens while freezing the autoregressive path leaves the target distribution unchanged for standard inputs. Because attention projectors participate in every subsequent layer computation, small modifications can propagate to alter hidden-state trajectories and logits unless an explicit isolation mechanism (e.g., a distribution-matching regularizer or architectural mask) is enforced. No such invariance argument or verification is supplied, making the preservation assumption load-bearing for the central lossless claim.

Authors: We acknowledge the referee's point on potential propagation through subsequent layers. The design freezes the autoregressive path for standard tokens and applies tuning exclusively to mask tokens that are absent from inference inputs. To strengthen the lossless claim, we will add to §3.1 both a formal argument establishing that mask-token modifications do not activate during standard generation and empirical verification via KL-divergence measurements between pre- and post-tuning output distributions on held-out standard sequences. These additions will be incorporated in the revision. revision: yes
Referee: [§5] §5 (Experiments): The reported throughput and acceptance-rate gains at large batch sizes must be accompanied by direct comparisons against both sequential speculative decoding and prior parallel methods, with explicit measurement of draft verification mismatch before and after Bonus-guided Calibration. Without these controls, the flexibility claim across batch sizes remains under-supported.

Authors: We agree that the requested controls would better substantiate the flexibility claim. We will expand §5 to include direct throughput and acceptance-rate comparisons against sequential speculative decoding at large batches, comparisons to additional prior parallel methods, and explicit quantification of draft verification mismatch (e.g., accepted-length discrepancy and logit calibration error) measured before versus after Bonus-guided Calibration. New tables and figures will be added to demonstrate the calibration's impact and sustained gains across batch sizes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; designs are independent engineering choices

full rationale

The paper introduces FlexDraft through three explicit design components—Attention Tuning (tuning final-layer attention projectors on mask tokens while freezing the autoregressive path), Bonus-guided Calibration (MLP conditioned on resolved bonus token), and Flex Decoding (dynamic switching between parallel and sequential modes). These are presented as practical solutions to batch-size limitations and verification mismatch, with the lossless property asserted as a direct consequence of keeping the autoregressive path frozen. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described claims. The derivation chain consists of independent architectural decisions rather than reductions to inputs by construction, making the framework self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5823 in / 1014 out tokens · 44311 ms · 2026-05-20T05:52:04.255978+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

[1]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. Pard: Accelerating llm inference with low-cost parallel draft model adaptation. InInternational Conference on Learning Representations, 2026

work page 2026
[2]

Ankner, R

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

work page arXiv 2024
[3]

Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InInternational Conference on Learning Repre- sentations, 2025

work page 2025
[4]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 5209–5235. PMLR, 2024

work page 2024
[5]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

arXiv preprint arXiv:2602.06036 , year=

Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page arXiv 2026
[7]

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. SDAR: A synergistic diffusion- autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025

work page arXiv 2025
[8]

Christopher, Brian R

Jacob K. Christopher, Brian R. Bartoldson, Tal Ben-Nun, Michael Cardei, Bhavya Kailkhura, and Ferdinando Fioretto. Speculative diffusion decoding: Accelerating language generation through diffusion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

work page 2025
[9]

Deepseek-v3 technical report, 2024

DeepSeek-AI. Deepseek-v3 technical report, 2024

work page 2024
[10]

DReSD: Dense retrieval for speculative decoding

Milan Gritta, Huiyin Xue, and Gerasimos Lampouras. DReSD: Dense retrieval for speculative decoding. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19822–19832, Vienna, Austria, 2025. Association for Computational Linguistics

work page 2025
[11]

REST: Retrieval-based spec- ulative decoding

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason Lee, and Di He. REST: Retrieval-based spec- ulative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1582–1595, Mexico City, Mexico, 2024. Association for Computational Linguistics

work page 2024
[12]

Mask tokens as prophet: Fine-grained cache eviction for efficient dllm inference.arXiv preprint arXiv:2510.09309, 2025

Jianuo Huang, Yaojie Zhang, Yicun Yang, Benhao Huang, Biqing Qi, Dongrui Liu, and Linfeng Zhang. Mask tokens as prophet: Fine-grained cache eviction for efficient dllm inference.arXiv preprint arXiv:2510.09309, 2025

work page arXiv 2025
[13]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 19274–19286. PMLR, 2023. 10

work page 2023
[14]

Diffuspec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025

Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. Diffuspec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025

work page arXiv 2025
[15]

EAGLE-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421–7432. Association for Computational Linguistics, 2024

work page 2024
[16]

EAGLE: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 28935–28948. PMLR, 2024

work page 2024
[17]

EAGLE-3: Scaling up inference acceleration of large language models via training-time test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[18]

Amphista: Bi-directional multi-head decoding for acceler- ating LLM inference

Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Guanchen Li, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, and Emad Barsoum. Amphista: Bi-directional multi-head decoding for acceler- ating LLM inference. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vo...

work page 2025
[19]

Bita: Bi-directional tuning for lossless acceleration in large language models.Expert Systems with Applications, 279:127305, 2025

Feng Lin, Hanling Yi, Yifan Yang, Hongbin Li, Xiaotian Yu, Guangming Lu, and Rong Xiao. Bita: Bi-directional tuning for lossless acceleration in large language models.Expert Systems with Applications, 279:127305, 2025

work page 2025
[20]

Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, and Chen Tian. Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

work page arXiv 2026
[21]

Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025

Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, and Pavlo Molchanov. Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025

work page arXiv 2025
[22]

Pearl: Parallel speculative decoding with adaptive draft length

Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. Pearl: Parallel speculative decoding with adaptive draft length. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[23]

Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025

work page arXiv 2025
[24]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM I...

work page 2024
[25]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[26]

RASD: Retrieval-augmented speculative decoding

Guofeng Quan, Wenfeng Feng, Chuzhan Hao, Guochao Jiang, Yuewei Zhang, and Hao Henry Wang. RASD: Retrieval-augmented speculative decoding. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Com- putational Linguistics: ACL 2025, pages 6167–6177, Vienna, Austria, July 2025. Association for...

work page 2025
[27]

Your llm knows the future: Uncovering its multi-token prediction potential.arXiv preprint arXiv:2507.11851, 2025

Mohammad Samragh, Arnav Kundu, David Harrison, Kumari Nishu, Devang Naik, Minsik Cho, and Mehrdad Farajtabar. Your llm knows the future: Uncovering its multi-token prediction potential.arXiv preprint arXiv:2507.11851, 2025

work page arXiv 2025
[28]

Specbranch: Speculative decoding via hybrid drafting and rollback-aware branch parallelism

Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, and Cong Wang. Specbranch: Speculative decoding via hybrid drafting and rollback-aware branch parallelism. InInternational Conference on Learning Representations, 2026

work page 2026
[29]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Accelerating diffusion large language models with slowfast sampling: The three golden principles.arXiv preprint arXiv:2506.10848, 2025

Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Puyu Zeng, Yuxuan Wang, Biqing Qi, Dongrui Liu, and Linfeng Zhang. Accelerating diffusion large language models with slowfast sampling: The three golden principles.arXiv preprint arXiv:2506.10848, 2025

work page arXiv 2025
[32]

Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm. arXiv preprint arXiv:2509.26328, 2025

work page arXiv 2025
[33]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Dreamon: Diffusion language models for code infilling beyond fixed-size canvas

Zirui Wu, Lin Zheng, Zhihui Xie, Jiacheng Ye, Jiahui Gao, Shansan Gong, Yansong Feng, Zhenguo Li, Wei Bi, Guorui Zhou, and Lingpeng Kong. Dreamon: Diffusion language models for code infilling beyond fixed-size canvas. InInternational Conference on Learning Representations, 2026

work page 2026
[35]

Unlocking efficiency in large language model inference: A comprehensive sur- vey of speculative decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive sur- vey of speculative decoding. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7655–7671, Bangkok, Thailand, 2024. Association for Computational L...

work page 2024
[36]

KOALA: Enhancing speculative decoding for LLM via multi-layer draft heads with adversarial learning.arXiv preprint arXiv:2408.08146, 2024

Kaiqi Zhang, Jing Zhao, and Rui Chen. KOALA: Enhancing speculative decoding for LLM via multi-layer draft heads with adversarial learning.arXiv preprint arXiv:2408.08146, 2024

work page arXiv 2024
[37]

Distillspec: Improving speculative decoding via knowledge distillation

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Ros- tamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation. InThe Twelfth International Conference on Learning Representations, 2024. 12 A Appendix A.1 Robustness to sampling temperature. Table 4: D...

work page 2024

[1] [1]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. Pard: Accelerating llm inference with low-cost parallel draft model adaptation. InInternational Conference on Learning Representations, 2026

work page 2026

[2] [2]

Ankner, R

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

work page arXiv 2024

[3] [3]

Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InInternational Conference on Learning Repre- sentations, 2025

work page 2025

[4] [4]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 5209–5235. PMLR, 2024

work page 2024

[5] [5]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

arXiv preprint arXiv:2602.06036 , year=

Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page arXiv 2026

[7] [7]

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. SDAR: A synergistic diffusion- autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025

work page arXiv 2025

[8] [8]

Christopher, Brian R

Jacob K. Christopher, Brian R. Bartoldson, Tal Ben-Nun, Michael Cardei, Bhavya Kailkhura, and Ferdinando Fioretto. Speculative diffusion decoding: Accelerating language generation through diffusion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

work page 2025

[9] [9]

Deepseek-v3 technical report, 2024

DeepSeek-AI. Deepseek-v3 technical report, 2024

work page 2024

[10] [10]

DReSD: Dense retrieval for speculative decoding

Milan Gritta, Huiyin Xue, and Gerasimos Lampouras. DReSD: Dense retrieval for speculative decoding. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19822–19832, Vienna, Austria, 2025. Association for Computational Linguistics

work page 2025

[11] [11]

REST: Retrieval-based spec- ulative decoding

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason Lee, and Di He. REST: Retrieval-based spec- ulative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1582–1595, Mexico City, Mexico, 2024. Association for Computational Linguistics

work page 2024

[12] [12]

Mask tokens as prophet: Fine-grained cache eviction for efficient dllm inference.arXiv preprint arXiv:2510.09309, 2025

Jianuo Huang, Yaojie Zhang, Yicun Yang, Benhao Huang, Biqing Qi, Dongrui Liu, and Linfeng Zhang. Mask tokens as prophet: Fine-grained cache eviction for efficient dllm inference.arXiv preprint arXiv:2510.09309, 2025

work page arXiv 2025

[13] [13]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 19274–19286. PMLR, 2023. 10

work page 2023

[14] [14]

Diffuspec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025

Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. Diffuspec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025

work page arXiv 2025

[15] [15]

EAGLE-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421–7432. Association for Computational Linguistics, 2024

work page 2024

[16] [16]

EAGLE: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 28935–28948. PMLR, 2024

work page 2024

[17] [17]

EAGLE-3: Scaling up inference acceleration of large language models via training-time test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[18] [18]

Amphista: Bi-directional multi-head decoding for acceler- ating LLM inference

Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Guanchen Li, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, and Emad Barsoum. Amphista: Bi-directional multi-head decoding for acceler- ating LLM inference. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vo...

work page 2025

[19] [19]

Bita: Bi-directional tuning for lossless acceleration in large language models.Expert Systems with Applications, 279:127305, 2025

Feng Lin, Hanling Yi, Yifan Yang, Hongbin Li, Xiaotian Yu, Guangming Lu, and Rong Xiao. Bita: Bi-directional tuning for lossless acceleration in large language models.Expert Systems with Applications, 279:127305, 2025

work page 2025

[20] [20]

Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, and Chen Tian. Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

work page arXiv 2026

[21] [21]

Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025

Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, and Pavlo Molchanov. Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025

work page arXiv 2025

[22] [22]

Pearl: Parallel speculative decoding with adaptive draft length

Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. Pearl: Parallel speculative decoding with adaptive draft length. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[23] [23]

Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025

work page arXiv 2025

[24] [24]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM I...

work page 2024

[25] [25]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[26] [26]

RASD: Retrieval-augmented speculative decoding

Guofeng Quan, Wenfeng Feng, Chuzhan Hao, Guochao Jiang, Yuewei Zhang, and Hao Henry Wang. RASD: Retrieval-augmented speculative decoding. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Com- putational Linguistics: ACL 2025, pages 6167–6177, Vienna, Austria, July 2025. Association for...

work page 2025

[27] [27]

Your llm knows the future: Uncovering its multi-token prediction potential.arXiv preprint arXiv:2507.11851, 2025

Mohammad Samragh, Arnav Kundu, David Harrison, Kumari Nishu, Devang Naik, Minsik Cho, and Mehrdad Farajtabar. Your llm knows the future: Uncovering its multi-token prediction potential.arXiv preprint arXiv:2507.11851, 2025

work page arXiv 2025

[28] [28]

Specbranch: Speculative decoding via hybrid drafting and rollback-aware branch parallelism

Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, and Cong Wang. Specbranch: Speculative decoding via hybrid drafting and rollback-aware branch parallelism. InInternational Conference on Learning Representations, 2026

work page 2026

[29] [29]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Accelerating diffusion large language models with slowfast sampling: The three golden principles.arXiv preprint arXiv:2506.10848, 2025

Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Puyu Zeng, Yuxuan Wang, Biqing Qi, Dongrui Liu, and Linfeng Zhang. Accelerating diffusion large language models with slowfast sampling: The three golden principles.arXiv preprint arXiv:2506.10848, 2025

work page arXiv 2025

[32] [32]

Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm. arXiv preprint arXiv:2509.26328, 2025

work page arXiv 2025

[33] [33]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Dreamon: Diffusion language models for code infilling beyond fixed-size canvas

Zirui Wu, Lin Zheng, Zhihui Xie, Jiacheng Ye, Jiahui Gao, Shansan Gong, Yansong Feng, Zhenguo Li, Wei Bi, Guorui Zhou, and Lingpeng Kong. Dreamon: Diffusion language models for code infilling beyond fixed-size canvas. InInternational Conference on Learning Representations, 2026

work page 2026

[35] [35]

Unlocking efficiency in large language model inference: A comprehensive sur- vey of speculative decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive sur- vey of speculative decoding. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7655–7671, Bangkok, Thailand, 2024. Association for Computational L...

work page 2024

[36] [36]

KOALA: Enhancing speculative decoding for LLM via multi-layer draft heads with adversarial learning.arXiv preprint arXiv:2408.08146, 2024

Kaiqi Zhang, Jing Zhao, and Rui Chen. KOALA: Enhancing speculative decoding for LLM via multi-layer draft heads with adversarial learning.arXiv preprint arXiv:2408.08146, 2024

work page arXiv 2024

[37] [37]

Distillspec: Improving speculative decoding via knowledge distillation

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Ros- tamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation. InThe Twelfth International Conference on Learning Representations, 2024. 12 A Appendix A.1 Robustness to sampling temperature. Table 4: D...

work page 2024