pith. sign in

arxiv: 2606.02544 · v1 · pith:EX6ERCVInew · submitted 2026-06-01 · 💻 cs.CL · cs.AI

SimSD: Simple Speculative Decoding in Diffusion Language Models

Pith reviewed 2026-06-28 14:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords speculative decodingdiffusion language modelsmasked language modelinginference accelerationattention maskingplug-and-play methodtoken verification
0
0 comments X

The pith

A plug-and-play masking strategy makes speculative decoding work with diffusion language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion language models decode tokens in parallel using masks but their bidirectional attention and changing contexts prevent direct use of speculative decoding, which relies on verifying multiple drafted tokens in one pass. SimSD adds reference tokens from a draft model and applies a custom attention mask so the target model can compute valid logits for those tokens without retraining. The approach keeps the original denoising distribution intact while restoring the verification step that causal masking provides in autoregressive models. If the method holds, it delivers large speedups on existing dLLM architectures and combines with other accelerations such as KV caching. Experiments on SDAR-family models across four benchmarks report up to 7.46 times higher throughput with no drop in average generation quality.

Core claim

By explicitly introducing reference tokens from draft-model predictions and designing an attention mask that regulates their interaction with current-step tokens, dLLMs can compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The method is training-free and integrates with KV cache and blockwise decoding.

What carries the argument

The plug-and-play masking strategy that introduces reference tokens from draft predictions and applies an attention mask to produce temporally valid token-level contexts for verification.

If this is right

  • SimSD works without any model retraining or fine-tuning.
  • It integrates directly with KV cache and blockwise decoding for further gains.
  • It produces up to 7.46x higher decoding throughput on SDAR-family dLLMs.
  • Average generation quality is maintained or improved across four benchmarks.
  • The same masking idea bridges the incompatibility between masked bidirectional attention and token-level speculative verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The masking approach could extend to other non-causal generative models that lack built-in causal context.
  • If the logits remain unbiased, the technique may reduce the speed gap between diffusion and autoregressive language models in latency-sensitive settings.
  • Attention masks that simulate causal properties without retraining might apply to additional bidirectional diffusion tasks beyond text.

Load-bearing premise

The introduced reference tokens and attention mask produce logits for drafted tokens that are valid for verification without changing the underlying diffusion denoising distribution or introducing systematic bias in the accepted tokens.

What would settle it

Generate the same prompts with standard dLLM decoding and with the SimSD mask; if the distribution of accepted tokens or final outputs shows systematic differences, the logits are not valid for unbiased verification.

Figures

Figures reproduced from arXiv: 2606.02544 by Chaojie Ren, Haoru Li, Haotian Ye, Hongcan Guo, Jingbo Shang, Jinya Jiang, Junxia Cui, Kaijie Zhu, Kun Zhou, Runchu Tian, Yiming Huang, Zhongkai Yu.

Figure 1
Figure 1. Figure 1: SimSD restores token-level speculative decoding for diffusion language models. Vanilla [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Token-level temporal causal attention. (a) A step map assigns each token a temporal [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Self draft distribution analysis. We use SDAR-8B-Chat as both draft and target with B = 4 and temperature 0 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SimSD, a training-free speculative decoding algorithm for diffusion language models (dLLMs). It introduces reference tokens from a draft model and a custom attention mask to create temporally valid token-level contexts, enabling verification of multiple drafted tokens in a single forward pass while preserving dLLMs' parallel decoding. Experiments on SDAR-family dLLMs across four benchmarks report up to 7.46x higher decoding throughput with maintained or improved generation quality. The method is presented as plug-and-play and compatible with techniques such as KV cache and blockwise decoding.

Significance. If the custom masking strategy is shown to preserve the original denoising distribution, the work would meaningfully extend speculative decoding to dLLMs, offering a practical acceleration path that combines verification efficiency with bidirectional parallelism. The reported speedups and training-free integration are potentially impactful for inference in non-autoregressive models, though the absence of a supporting invariance argument limits the strength of the central claim.

major comments (2)
  1. [Abstract / Methods] Abstract and methods description of the attention mask: the claim that the mask 'regulates their interaction with current-step tokens' to yield logits 'usable for verification' without altering the diffusion denoising distribution lacks any derivation, invariance argument, or proof that the modified bidirectional context remains distributionally equivalent to standard dLLM inference. This is load-bearing for both the correctness of accepted tokens and the quality-maintenance result.
  2. [Experiments] Experiments section: the 7.46x throughput and quality claims are reported without error bars, variance across runs, details on the exact masking rules per denoising step, or ablations isolating the reference-token placement, preventing verification that results are robust rather than benchmark-specific.
minor comments (2)
  1. [Abstract] The abstract would be clearer if it briefly stated the number of denoising steps, model sizes, and exact benchmark datasets used in the reported experiments.
  2. [Methods] Notation for 'reference tokens' and the precise form of the custom attention mask should be formalized with an equation or pseudocode early in the methods to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key opportunities to strengthen the theoretical grounding and experimental reporting of SimSD. We address each major comment below and will make corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and methods description of the attention mask: the claim that the mask 'regulates their interaction with current-step tokens' to yield logits 'usable for verification' without altering the diffusion denoising distribution lacks any derivation, invariance argument, or proof that the modified bidirectional context remains distributionally equivalent to standard dLLM inference. This is load-bearing for both the correctness of accepted tokens and the quality-maintenance result.

    Authors: We agree that the manuscript would be strengthened by an explicit argument establishing distributional equivalence. The current presentation relies on the design of the mask to enforce temporally valid contexts for reference tokens while retaining bidirectional attention for the active denoising step, together with empirical evidence of preserved quality. In revision we will add a dedicated subsection deriving the equivalence: we will show that, for each drafted token, the mask restricts attention exclusively to tokens that have already been denoised in prior steps (preventing leakage of future information) and that the resulting conditional distribution over the verification logits is identical to the one obtained under standard dLLM inference at the corresponding noise level. revision: yes

  2. Referee: [Experiments] Experiments section: the 7.46x throughput and quality claims are reported without error bars, variance across runs, details on the exact masking rules per denoising step, or ablations isolating the reference-token placement, preventing verification that results are robust rather than benchmark-specific.

    Authors: We accept that additional statistical and methodological detail is needed. In the revised Experiments section we will report means and standard deviations over multiple independent runs, supply the precise per-step masking rules (including how reference tokens are positioned relative to the current denoising mask), and include ablations that vary reference-token placement. These additions will substantiate the robustness of the reported throughput gains and quality metrics across the four benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on experimental results

full rationale

The paper introduces a training-free masking strategy (SimSD) to enable speculative decoding for dLLMs and validates performance via benchmark experiments reporting up to 7.46x throughput with maintained quality. No equations, fitted parameters, self-citations, or derivations are presented that reduce any central claim to its own inputs by construction. The method is described as a plug-and-play attention mask modification whose correctness is asserted through empirical outcomes rather than a self-referential chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are mentioned in the abstract; the method is presented as a training-free algorithmic change.

pith-pipeline@v0.9.1-grok · 5832 in / 1095 out tokens · 18104 ms · 2026-06-28T14:55:54.269499+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 4 canonical work pages

  1. [1]

    Spiffy: Multiplying diffusion llm acceleration via lossless speculative decoding,

    Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, and Fatih Porikli. Spiffy: Multiplying diffusion llm acceleration via lossless speculative decoding,

  2. [2]

    URLhttps://arxiv.org/abs/2509.18085

  3. [3]

    Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov

    Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tyEyYT267x

  4. [4]

    Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

    Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 17981–17993. Curran Associates, Inc., 2021. URL htt...

  5. [5]

    Program synthesis with large language models, 2021

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models, 2021. URLhttps://arxiv.org/abs/2108.07732

  6. [6]

    LLaDA2.0: Scaling up diffusion language models to 100b, 2025

    Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Li...

  7. [7]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 5209–5235. PMLR, 2024. URL https://proceedings.mlr...

  8. [8]

    Accelerating large language model decoding with speculative sampling, 2023

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023. URLhttps://arxiv.org/abs/2302.01318

  9. [9]

    DFlash: Block diffusion for flash speculative decoding, 2026

    Jian Chen, Yesheng Liang, and Zhijian Liu. DFlash: Block diffusion for flash speculative decoding, 2026. URLhttps://arxiv.org/abs/2602.06036

  10. [10]

    SDAR: A synergistic diffusion- autoregression paradigm for scalable sequence generation, 2025

    Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. SDAR: A synergistic diffusion- autoregression paradigm for scalable sequence generation, 2025. URL https://arxiv.org/ abs/2510.06303

  11. [11]

    PaLM: Scaling language modeling with pathways, 2022

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204. 02311

  12. [12]

    and Ben-Nun, Tal and Cardei, Michael and Kailkhura, Bhavya and Fioretto, Ferdinando

    Jacob K. Christopher, Brian R. Bartoldson, Tal Ben-Nun, Michael Cardei, Bhavya Kailkhura, and Ferdinando Fioretto. Speculative diffusion decoding: Accelerating language generation through diffusion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pag...

  13. [13]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168

  14. [14]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

  15. [15]

    Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

  16. [16]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186. Association for Computational Linguistics,

  17. [17]

    URLhttps://aclanthology.org/N19-1423

  18. [18]

    Break the sequential dependency of LLM inference using lookahead decoding

    Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 14060–14079. PMLR, 2024. URL https://proceedings.mlr.press/v235/fu24a.html

  19. [19]

    Self speculative decoding for diffusion large language models, 2025

    Yifeng Gao, Ziang Ji, Yuxuan Wang, Biqing Qi, Hanlin Xu, and Linfeng Zhang. Self speculative decoding for diffusion large language models, 2025. URL https://arxiv.org/abs/2510. 04147

  20. [20]

    Scaling diffusion language models via adaptation from autoregressive models

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling diffusion language models via adaptation from autoregressive models. InInternational Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=j1tSLYKwg8. arXiv:2410.17891

  21. [21]

    Gemini diffusion - google deepmind, 2025

    Google DeepMind. Gemini diffusion - google deepmind, 2025. URL https://deepmind. google/models/gemini-diffusion/. Accessed: 2025-09-19

  22. [22]

    S2D2: Fast decoding for diffusion llms via training-free self-speculation, 2026

    Ligong Han, Hao Wang, Han Gao, Kai Xu, and Akash Srivastava. S2D2: Fast decoding for diffusion llms via training-free self-speculation, 2026. URL https://arxiv.org/abs/2603. 25702

  23. [23]

    DiffusionBERT: Improving generative masked language models with diffusion models

    Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. DiffusionBERT: Improving generative masked language models with diffusion models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 4521–...

  24. [24]

    Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  25. [25]

    Abdelfattah, Jae sun Seo, Zhiru Zhang, and Udit Gupta

    Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S. Abdelfattah, Jae sun Seo, Zhiru Zhang, and Udit Gupta. FlashDLM: Accelerating diffusion language model inference via efficient kv caching and guided diffusion, 2025. URLhttps://arxiv.org/abs/2505.21467

  26. [26]

    Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al

    Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card, 2024. URL https://arxiv.org/abs/2410.21276

  27. [27]

    T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Com- putational Linguistics (V olume 1: Long Papers), pages 1601–1611, Vancouver, Canada, 11 July 2...

  28. [28]

    Mercury: Ultra-fast language models based on diffusion, 2025

    Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, et al. Mercury: Ultra-fast language models based on diffusion, 2025. URLhttps://arxiv.org/abs/2506.17298

  29. [29]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 19274–19286. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/leviathan23a.html

  30. [30]

    DiffuSpec: Unlocking diffusion language models for speculative decoding, 2025

    Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. DiffuSpec: Unlocking diffusion language models for speculative decoding, 2025. URL https: //arxiv.org/abs/2510.02358

  31. [31]

    Hashimoto

    Xiang Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto. Diffusion-LM improves controllable text generation. InProceed- ings of the 36th International Conference on Neural Information Processing Sys- tems, 2022. URL https://papers.nips.cc/paper_files/paper/2022/hash/ 1be5bc25d50895ee656b8c2d9eb89d6a-Abstract-Conference.html

  32. [32]

    EAGLE: Speculative sampling requires rethinking feature uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 28935– 28948. PMLR, 2024. URLhttps://proceedings.mlr.press/v235/li24bt.html

  33. [33]

    dLLM-Cache: Accelerating diffusion large language models with adaptive caching, 2025

    Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dLLM-Cache: Accelerating diffusion large language models with adaptive caching, 2025. URLhttps://arxiv.org/abs/2506.06295

  34. [34]

    dKV-Cache: The cache for diffusion language models, 2025

    Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dKV-Cache: The cache for diffusion language models, 2025. URLhttps://arxiv.org/abs/2505.15781

  35. [35]

    Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. SpecInfer: Accelerating generative large language model serving with tree-based speculative inference and verification. InProceedings of the...

  36. [36]

    Large language models: A survey, 2024

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2024. URL https://arxiv. org/abs/2402.06196

  37. [37]

    Large language diffusion models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= KnqiC0znVF

  38. [38]

    BlockSpec: Blockwise speculative decoding for diffusion llms, 2025

    Tianxiang Pan, Baitao Gong, Mo Guang, Hongwei Yong, Tianpeng Jiang, Yaqian Li, Zheng Cao, and Kaiwen Long. BlockSpec: Blockwise speculative decoding for diffusion llms, 2025. URLhttps://openreview.net/forum?id=hmAviop5rm. OpenReview preprint

  39. [39]

    Christopher, Thomas Hartvigsen, and Ferdinando Fioretto

    Jameson Sandler, Jacob K. Christopher, Thomas Hartvigsen, and Ferdinando Fioretto. SpecDiff- 2: Scaling diffusion drafter alignment for faster speculative decoding, 2025. URL https: //arxiv.org/abs/2511.00606

  40. [40]

    RoFormer: Enhanced transformer with rotary position embedding, 2021

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding, 2021. URL https://arxiv.org/ abs/2104.09864. 12

  41. [41]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAd- vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,

  42. [42]

    URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  43. [43]

    Fast-dllm: Training-free acceleration of diffusion llm by enabling KV cache and parallel decoding

    Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling KV cache and parallel decoding. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=3Z3Is6hnOT

  44. [44]

    Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

    Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7655–7671, Bangkok, Thailand, August 2024. Association for Computatio...

  45. [45]

    Dream-Coder 7B: An open diffusion language model for code, 2025

    Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream-Coder 7B: An open diffusion language model for code, 2025. URLhttps://arxiv.org/abs/2509.01142

  46. [46]

    MMaDA: Multimodal large diffusion language models, 2025

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. MMaDA: Multimodal large diffusion language models, 2025. URL https://arxiv.org/ abs/2505.15809

  47. [47]

    Dream 7B: Diffusion large language models, 2025

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7B: Diffusion large language models, 2025. URL https://arxiv.org/abs/ 2508.15487

  48. [48]

    A survey of large language models, 2023

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models, 2023. URLhttps://arxiv.org/abs/2303.18223. 13