FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration
Pith reviewed 2026-05-20 05:52 UTC · model grok-4.3
The pith
FlexDraft enables lossless speculative decoding that adapts to any batch size by tuning attention and calibrating bonus tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlexDraft is a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs: Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters; Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty; Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential at
What carries the argument
Attention Tuning on final-layer projectors using mask tokens, paired with Bonus-guided Calibration via a lightweight MLP on the resolved bonus token and dynamic Flex Decoding mode switching.
If this is right
- The target model distribution remains exactly unchanged, guaranteeing lossless generation.
- Only a small set of attention parameters need training, keeping overhead low.
- Draft verification mismatch from bonus uncertainty is reduced through explicit calibration.
- Redundant computation is avoided by switching modes and lengths based on batch size and confidence.
- Throughput gains from parallel verification are preserved rather than collapsing at scale.
Where Pith is reading between the lines
- The tuning strategy might allow reuse of the same target model for both drafting and verification in resource-constrained settings.
- Similar calibration could address uncertainty in other multi-token prediction schemes beyond speculative decoding.
- The dynamic switching logic might generalize to mixed workloads that combine generation with retrieval or tool use.
Load-bearing premise
Tuning only the attention projectors of the final few layers on mask tokens while keeping the autoregressive path frozen preserves the target distribution and produces high quality drafts.
What would settle it
An experiment that measures acceptance rates and end-to-end throughput at large batch sizes and finds them no better than standard sequential speculative decoding or shows any quality drop would disprove the adaptation claim.
Figures
read the original abstract
Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead. Parallel speculative decoding addresses this limitation by performing drafting and verification within a single target forward pass, allowing future drafts to be prepared while current candidates are being verified. Although effective at small batch sizes, existing parallel speculative decoding methods either require costly continual pretraining with quality degradation or suffer from low acceptance rates. More importantly, this paradigm inherently suffers from uncertainty in both the bonus token and the accepted length, leading to draft verification mismatch and causing throughput gains to collapse at large batch sizes. To address these limitations, we introduce FlexDraft, a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs. (1) Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters. (2) Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty. (3) Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential draft then verify at large batch sizes, and adjusts verification length based on draft confidence to eliminate redundant computation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FlexDraft, a lossless speculative decoding framework for LLMs that adapts to varying batch sizes. It proposes three designs: (1) Attention Tuning, which tunes only the attention projectors of the final few layers on mask tokens while freezing the autoregressive path to preserve the target distribution and generate high-quality drafts with few parameters; (2) Bonus-guided Calibration, which uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits and reduce verification mismatch; and (3) Flex Decoding, which switches between parallel draft-and-verify at small batches and sequential draft-then-verify at large batches while adjusting verification length by draft confidence.
Significance. If the lossless property and throughput improvements hold across batch sizes, the work would meaningfully advance memory-bound LLM inference by mitigating limitations of prior parallel speculative decoding approaches, such as low acceptance rates and collapse at scale. The minimal-parameter Attention Tuning and dynamic mode switching are practical strengths that could enable broader adoption in production settings.
major comments (2)
- [§3.1] §3.1 (Attention Tuning): The lossless guarantee rests on the claim that tuning attention projectors only on mask tokens while freezing the autoregressive path leaves the target distribution unchanged for standard inputs. Because attention projectors participate in every subsequent layer computation, small modifications can propagate to alter hidden-state trajectories and logits unless an explicit isolation mechanism (e.g., a distribution-matching regularizer or architectural mask) is enforced. No such invariance argument or verification is supplied, making the preservation assumption load-bearing for the central lossless claim.
- [§5] §5 (Experiments): The reported throughput and acceptance-rate gains at large batch sizes must be accompanied by direct comparisons against both sequential speculative decoding and prior parallel methods, with explicit measurement of draft verification mismatch before and after Bonus-guided Calibration. Without these controls, the flexibility claim across batch sizes remains under-supported.
minor comments (2)
- The abstract would be strengthened by a single sentence summarizing the empirical acceptance rates and throughput improvements observed.
- [§3.2] Notation for the bonus token and calibrated logits should be introduced consistently in §3.2 to avoid ambiguity when describing the MLP conditioning.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [§3.1] §3.1 (Attention Tuning): The lossless guarantee rests on the claim that tuning attention projectors only on mask tokens while freezing the autoregressive path leaves the target distribution unchanged for standard inputs. Because attention projectors participate in every subsequent layer computation, small modifications can propagate to alter hidden-state trajectories and logits unless an explicit isolation mechanism (e.g., a distribution-matching regularizer or architectural mask) is enforced. No such invariance argument or verification is supplied, making the preservation assumption load-bearing for the central lossless claim.
Authors: We acknowledge the referee's point on potential propagation through subsequent layers. The design freezes the autoregressive path for standard tokens and applies tuning exclusively to mask tokens that are absent from inference inputs. To strengthen the lossless claim, we will add to §3.1 both a formal argument establishing that mask-token modifications do not activate during standard generation and empirical verification via KL-divergence measurements between pre- and post-tuning output distributions on held-out standard sequences. These additions will be incorporated in the revision. revision: yes
-
Referee: [§5] §5 (Experiments): The reported throughput and acceptance-rate gains at large batch sizes must be accompanied by direct comparisons against both sequential speculative decoding and prior parallel methods, with explicit measurement of draft verification mismatch before and after Bonus-guided Calibration. Without these controls, the flexibility claim across batch sizes remains under-supported.
Authors: We agree that the requested controls would better substantiate the flexibility claim. We will expand §5 to include direct throughput and acceptance-rate comparisons against sequential speculative decoding at large batches, comparisons to additional prior parallel methods, and explicit quantification of draft verification mismatch (e.g., accepted-length discrepancy and logit calibration error) measured before versus after Bonus-guided Calibration. New tables and figures will be added to demonstrate the calibration's impact and sustained gains across batch sizes. revision: yes
Circularity Check
No significant circularity; designs are independent engineering choices
full rationale
The paper introduces FlexDraft through three explicit design components—Attention Tuning (tuning final-layer attention projectors on mask tokens while freezing the autoregressive path), Bonus-guided Calibration (MLP conditioned on resolved bonus token), and Flex Decoding (dynamic switching between parallel and sequential modes). These are presented as practical solutions to batch-size limitations and verification mismatch, with the lossless property asserted as a direct consequence of keeping the autoregressive path frozen. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described claims. The derivation chain consists of independent architectural decisions rather than reductions to inputs by construction, making the framework self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Pard: Accelerating llm inference with low-cost parallel draft model adaptation
Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. Pard: Accelerating llm inference with low-cost parallel draft model adaptation. InInternational Conference on Learning Representations, 2026
work page 2026
- [2]
-
[3]
Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov
Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InInternational Conference on Learning Repre- sentations, 2025
work page 2025
-
[4]
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 5209–5235. PMLR, 2024
work page 2024
-
[5]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
arXiv preprint arXiv:2602.06036 , year=
Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026
-
[7]
SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation
Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. SDAR: A synergistic diffusion- autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025
-
[8]
Jacob K. Christopher, Brian R. Bartoldson, Tal Ben-Nun, Michael Cardei, Bhavya Kailkhura, and Ferdinando Fioretto. Speculative diffusion decoding: Accelerating language generation through diffusion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...
work page 2025
- [9]
-
[10]
DReSD: Dense retrieval for speculative decoding
Milan Gritta, Huiyin Xue, and Gerasimos Lampouras. DReSD: Dense retrieval for speculative decoding. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19822–19832, Vienna, Austria, 2025. Association for Computational Linguistics
work page 2025
-
[11]
REST: Retrieval-based spec- ulative decoding
Zhenyu He, Zexuan Zhong, Tianle Cai, Jason Lee, and Di He. REST: Retrieval-based spec- ulative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1582–1595, Mexico City, Mexico, 2024. Association for Computational Linguistics
work page 2024
-
[12]
Jianuo Huang, Yaojie Zhang, Yicun Yang, Benhao Huang, Biqing Qi, Dongrui Liu, and Linfeng Zhang. Mask tokens as prophet: Fine-grained cache eviction for efficient dllm inference.arXiv preprint arXiv:2510.09309, 2025
-
[13]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 19274–19286. PMLR, 2023. 10
work page 2023
-
[14]
Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. Diffuspec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025
-
[15]
EAGLE-2: Faster inference of language models with dynamic draft trees
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421–7432. Association for Computational Linguistics, 2024
work page 2024
-
[16]
EAGLE: Speculative sampling requires rethinking feature uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 28935–28948. PMLR, 2024
work page 2024
-
[17]
EAGLE-3: Scaling up inference acceleration of large language models via training-time test
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. InAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[18]
Amphista: Bi-directional multi-head decoding for acceler- ating LLM inference
Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Guanchen Li, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, and Emad Barsoum. Amphista: Bi-directional multi-head decoding for acceler- ating LLM inference. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vo...
work page 2025
-
[19]
Feng Lin, Hanling Yi, Yifan Yang, Hongbin Li, Xiaotian Yu, Guangming Lu, and Rong Xiao. Bita: Bi-directional tuning for lossless acceleration in large language models.Expert Systems with Applications, 279:127305, 2025
work page 2025
-
[20]
Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, and Chen Tian. Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026
-
[21]
Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025
Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, and Pavlo Molchanov. Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025
-
[22]
Pearl: Parallel speculative decoding with adaptive draft length
Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. Pearl: Parallel speculative decoding with adaptive draft length. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[23]
Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P
Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025
-
[24]
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM I...
work page 2024
-
[25]
Large language diffusion models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[26]
RASD: Retrieval-augmented speculative decoding
Guofeng Quan, Wenfeng Feng, Chuzhan Hao, Guochao Jiang, Yuewei Zhang, and Hao Henry Wang. RASD: Retrieval-augmented speculative decoding. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Com- putational Linguistics: ACL 2025, pages 6167–6177, Vienna, Austria, July 2025. Association for...
work page 2025
-
[27]
Mohammad Samragh, Arnav Kundu, David Harrison, Kumari Nishu, Devang Naik, Minsik Cho, and Mehrdad Farajtabar. Your llm knows the future: Uncovering its multi-token prediction potential.arXiv preprint arXiv:2507.11851, 2025
-
[28]
Specbranch: Speculative decoding via hybrid drafting and rollback-aware branch parallelism
Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, and Cong Wang. Specbranch: Speculative decoding via hybrid drafting and rollback-aware branch parallelism. InInternational Conference on Learning Representations, 2026
work page 2026
-
[29]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Puyu Zeng, Yuxuan Wang, Biqing Qi, Dongrui Liu, and Linfeng Zhang. Accelerating diffusion large language models with slowfast sampling: The three golden principles.arXiv preprint arXiv:2506.10848, 2025
-
[32]
Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm. arXiv preprint arXiv:2509.26328, 2025
-
[33]
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Dreamon: Diffusion language models for code infilling beyond fixed-size canvas
Zirui Wu, Lin Zheng, Zhihui Xie, Jiacheng Ye, Jiahui Gao, Shansan Gong, Yansong Feng, Zhenguo Li, Wei Bi, Guorui Zhou, and Lingpeng Kong. Dreamon: Diffusion language models for code infilling beyond fixed-size canvas. InInternational Conference on Learning Representations, 2026
work page 2026
-
[35]
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive sur- vey of speculative decoding. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7655–7671, Bangkok, Thailand, 2024. Association for Computational L...
work page 2024
-
[36]
Kaiqi Zhang, Jing Zhao, and Rui Chen. KOALA: Enhancing speculative decoding for LLM via multi-layer draft heads with adversarial learning.arXiv preprint arXiv:2408.08146, 2024
-
[37]
Distillspec: Improving speculative decoding via knowledge distillation
Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Ros- tamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation. InThe Twelfth International Conference on Learning Representations, 2024. 12 A Appendix A.1 Robustness to sampling temperature. Table 4: D...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.