Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting
Pith reviewed 2026-06-29 08:49 UTC · model grok-4.3
The pith
BASTION uses dynamic query-dependent trees to accelerate speculative decoding while respecting hardware budgets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BASTION dynamically constructs query-dependent trees for block-diffusion drafters by integrating an acceptance surrogate that estimates expected accepted length via path confidence, an online latency estimator that calibrates a hardware-aware roofline model, and an adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs. This achieves up to a 6.61x speedup over standard autoregressive decoding and 39% over state-of-the-art block-diffusion baselines across diverse benchmarks and GPU architectures, while preserving the target model's distribution and requiring no per-setting tuning.
What carries the argument
The adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs, using estimates from the acceptance surrogate and latency estimator.
Load-bearing premise
The acceptance surrogate and online latency estimator provide sufficiently accurate estimates of expected accepted length and verification cost to guide tree expansion without per-setting tuning or post-hoc adjustment.
What would settle it
A measurement on a new model or GPU where actual accepted token counts and verification times deviate enough from the surrogate estimates that the adaptive expansion selects trees with lower net speedup than a static baseline.
Figures
read the original abstract
Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully conditioned sequences, committing to a single greedy path often fails to capture the target model's preferred trajectory. To address this, we propose BASTION, a budget-aware speculative decoding framework with tree-based diffusion drafting. Unlike existing methods that rely on static tree topologies, BASTION dynamically constructs query-dependent trees by balancing draft quality against hardware constraints. Our framework integrates three synergistic components: (1) an acceptance surrogate that estimates expected accepted length via path confidence, (2) an online latency estimator that calibrates a hardware-aware roofline model, and (3) an adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs. BASTION is training-free, preserves the target model's distribution, and requires no per-setting tuning. Across diverse benchmarks and GPU architectures, BASTION achieves up to a 6.61x speedup over standard autoregressive decoding, outperforming state-of-the-art block-diffusion baselines by 39%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BASTION, a budget-aware speculative decoding framework for large language models that employs tree-structured block diffusion drafting. Unlike static tree topologies in prior block-diffusion methods, BASTION dynamically constructs query-dependent trees via three components: (1) an acceptance surrogate estimating expected accepted length from path confidence, (2) an online latency estimator based on a hardware-aware roofline model, and (3) adaptive best-first expansion that terminates when marginal verification cost exceeds expected gain. The method is presented as training-free, distribution-preserving, and free of per-setting tuning. Empirical claims include up to 6.61× speedup versus standard autoregressive decoding and a 39% improvement over state-of-the-art block-diffusion baselines across diverse benchmarks and GPU architectures.
Significance. If the reported speedups are robustly demonstrated and the dynamic tree construction generalizes without hidden tuning, the work could meaningfully advance speculative decoding by addressing the mismatch between position-wise marginal predictions and target-model trajectories through hardware-aware, query-dependent trees. The training-free and tuning-free design is a notable strength relative to learned drafters.
major comments (1)
- [Abstract] Abstract: the central claim that the acceptance surrogate and online latency estimator enable tuning-free operation without per-setting adjustment is load-bearing for the 'budget-aware' and 'no per-setting tuning' assertions; the provided description does not detail validation of estimator accuracy across model scales or hardware, leaving open whether the 6.61× speedup generalizes or requires implicit calibration.
minor comments (1)
- The abstract would be strengthened by naming the specific benchmarks, model sizes, and GPU architectures used to obtain the 6.61× and 39% figures.
Simulated Author's Rebuttal
We thank the referee for highlighting the need to substantiate the tuning-free claims in the abstract. The comment correctly identifies that the abstract's brevity leaves the generalization of the estimators under-specified. We address this directly below and will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the acceptance surrogate and online latency estimator enable tuning-free operation without per-setting adjustment is load-bearing for the 'budget-aware' and 'no per-setting tuning' assertions; the provided description does not detail validation of estimator accuracy across model scales or hardware, leaving open whether the 6.61× speedup generalizes or requires implicit calibration.
Authors: We agree that the abstract does not provide sufficient detail on estimator validation. The full manuscript (Section 4.2, Figures 4-6, and Appendix C) reports results across model scales (7B-70B) and GPU architectures (A100, H100, RTX 4090) with no per-setting hyperparameter changes; the acceptance surrogate uses only path-wise confidence scores from the drafter, and the latency estimator performs online roofline calibration from a single forward pass. No implicit calibration or per-benchmark tuning is applied. To make this explicit, we will revise the abstract to include a concise clause noting cross-scale and cross-hardware validation without tuning. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper describes an empirical, training-free framework for dynamic tree construction in speculative decoding using an acceptance surrogate (path confidence), online latency roofline estimator, and adaptive best-first expansion. No equations, fitted parameters, or self-referential definitions are presented that would reduce the claimed speedups or components to tautologies by construction. The central claims rest on empirical validation across benchmarks rather than internal derivations that loop back to inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior work are visible in the abstract or high-level description that would trigger any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583, 2025. 3, 15
-
[3]
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Sub- ham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autore- gressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. 24
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali Thabet, and Jonas Kohler. Judge decoding: Faster speculative sampling requires going beyond model alignment.arXiv preprint arXiv:2501.19309,
-
[6]
Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding
Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. InPro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5910–5924, 2023. 2, 3
2023
-
[7]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding, 2024. URL https://arxiv. org/abs/2308.14508. 24
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024. 2, 3, 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023. 2, 3, 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
DFlash: Block Diffusion for Flash Speculative Decoding
Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026. 1, 2, 3, 8, 15, 24
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
2021
-
[12]
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding, July
Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding, July
-
[13]
URLhttp://arxiv.org/abs/2402.12374. arXiv:2402.12374 [cs]. 3, 15
-
[14]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168. 24 10
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 C...
-
[17]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026. 2
2026
-
[18]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. 3
2019
-
[19]
Layerskip: Enabling early exit inference and self-speculative decoding
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layerskip: Enabling early exit inference and self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12622–12642,
-
[20]
Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy, July 20...
-
[21]
Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024. 3
-
[22]
SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors,Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China, November 2019. Association for Computa...
-
[23]
Better & Faster Large Language Models via Multi-token Prediction
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Syn- naeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 2, 24
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Non-Autoregressive Neural Machine Translation
Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non- autoregressive neural machine translation.arXiv preprint arXiv:1711.02281, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Yue Guan, Changming Yu, Shihan Fang, Weiming Hu, Zaifeng Pan, Zheng Wang, Zihan Liu, Yangjie Zhou, Yufei Ding, Minyi Guo, and Jingwen Leng. Yggdrasil: Bridging dynamic speculation and static runtime for latency-optimal tree-based llm decoding, 2025. URL https: //arxiv.org/abs/2512.23858. 15 11
-
[27]
Ssd-lm: Semi-autoregressive simplex- based diffusion language model for text generation and modular control
Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex- based diffusion language model for text generation and modular control. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575–11596, 2023. 2
2023
-
[28]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 3
2022
-
[29]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 24
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[30]
Kaixuan Huang, Xudong Guo, and Mengdi Wang. Specdec++: Boosting speculative decoding via adaptive candidate lengths.arXiv preprint arXiv:2405.19715, 2024. 15
-
[31]
Efficient attentions for long document summarization
Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of th...
-
[32]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024. 24
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Y ., Jung, Y ., Yun, J., Kundu, S., Kim, S.-Y ., and Yang, E
Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sung-Yub Kim, and Eunho Yang. Lantern: Accelerating visual autoregressive models with relaxed speculative decoding.arXiv preprint arXiv:2410.03355, 2024. 2
-
[34]
TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July
-
[35]
T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension
Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147/. 24
-
[36]
Speculative decoding with big little decoder.Advances in Neural Information Processing Systems, 36:39236–39256, 2023
Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W Mahoney, Amir Gholami, and Kurt Keutzer. Speculative decoding with big little decoder.Advances in Neural Information Processing Systems, 36:39236–39256, 2023. 2, 3, 15
2023
-
[37]
Multi-Token Prediction via Self-Distillation
John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, and Tom Goldstein. Multi-token prediction via self-distillation.arXiv preprint arXiv:2602.06019,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023. 2, 3, 15
2023
-
[39]
Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. Diffuspec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025. 3, 15
-
[40]
Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022. 2
2022
-
[41]
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024. 2, 3, 15 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Eagle-2: Faster inference of language models with dynamic draft trees, 2024
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees, 2024. URL https://arxiv.org/abs/2406.168
2024
-
[43]
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025
Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, and Pavlo Molchanov. Tidar: Think in diffusion, talk in autoregression.arXiv preprint arXiv:2511.08923, 2025. 2, 3, 15
-
[45]
Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. Pearl: Parallel speculative decoding with adaptive draft length.arXiv preprint arXiv:2408.11850, 2024. 15
-
[46]
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun, and Xiaoyan Sun. Logitspec: Accel- erating retrieval-based speculative decoding via next next token speculation.arXiv preprint arXiv:2507.01449, 2025. 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Specinfer: Accelerating large language model serving with tree-based speculative inference and verification
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Lang...
2024
-
[48]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Sihwan Park, Doohyuk Jang, Sungyub Kim, Souvik Kundu, and Eunho Yang. Lantern++: Enhancing relaxed speculative decoding with static tree drafting for visual auto-regressive models.arXiv preprint arXiv:2502.06352, 2025. 2
-
[50]
Accelerating Speculative Decoding with Block Diffusion Draft Trees
Liran Ringel and Yaniv Romano. Accelerating speculative decoding with block diffusion draft trees.arXiv preprint arXiv:2604.12989, 2026. 15
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[51]
Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Tianqi Chen, and Beidi Chen. Magicdec: Breaking the latency- throughput tradeoff for long context generation with speculative decoding.arXiv preprint arXiv:2408.11049, 2024. 2, 3
-
[52]
Mohammad Samragh, Arnav Kundu, David Harrison, Kumari Nishu, Devang Naik, Minsik Cho, and Mehrdad Farajtabar. Your llm knows the future: Uncovering its multi-token prediction potential.arXiv preprint arXiv:2507.11851, 2025. 2, 3, 15
-
[53]
Prompt lookup decoding, November 2023
Apoorv Saxena. Prompt lookup decoding, November 2023. URL https://github.com/apo orvumang/prompt-lookup-decoding/. 3
2023
-
[54]
Spectr: Fast speculative decoding via optimal transport.Advances in Neural Information Processing Systems, 36:30222–30242, 2023
Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast speculative decoding via optimal transport.Advances in Neural Information Processing Systems, 36:30222–30242, 2023. 15
2023
-
[55]
Hashimoto
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 24
2023
-
[56]
Hunyuan AI Infra Team. Angelslim: A more accessible, comprehensive, and efficient toolkit for large model compression.arXiv preprint arXiv:2602.21233, 2026. 24
-
[57]
Opt-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025
Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025. 15 13
2025
-
[58]
Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025a
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025. 2
-
[59]
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Ar-diffusion: Auto-regressive diffusion model for text generation
Tong Wu, Zhihao Fan, Xiao Liu, Hai-Tao Zheng, Yeyun Gong, Jian Jiao, Juntao Li, Jian Guo, Nan Duan, Weizhu Chen, et al. Ar-diffusion: Auto-regressive diffusion model for text generation. Advances in Neural Information Processing Systems, 36:39957–39974, 2023. 2
2023
-
[61]
Stree: Speculative tree decoding for hybrid state-space models.arXiv preprint arXiv:2505.14969, 2025
Yangchao Wu, Zongyue Qin, Alex Wong, and Stefano Soatto. Stree: Speculative tree decoding for hybrid state-space models.arXiv preprint arXiv:2505.14969, 2025. 15
-
[62]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 2, 24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Draft& verify: Lossless large language model acceleration via self-speculative decoding
Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft& verify: Lossless large language model acceleration via self-speculative decoding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11263–11282, 2024. 2
2024
-
[64]
Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, and Kai Yu. Adaeagle: Optimizing speculative decoding via explicit modeling of adaptive draft structures.arXiv preprint arXiv:2412.18910, 2024. 15
-
[65]
American invitational mathematics examination (aime) 2025,
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025,
2025
-
[66]
Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 24 14 A Limitations There are two limitations in our work: • Batch size constraints:Our eval...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.