pith. sign in

arxiv: 2605.24832 · v1 · pith:KPC2PYCKnew · submitted 2026-05-24 · 💻 cs.DC

Optimus: Elastic Decoding for Efficient Diffusion LLM Serving

Pith reviewed 2026-06-30 00:13 UTC · model grok-4.3

classification 💻 cs.DC
keywords diffusion llmelastic decodingllm servingchunked decodingsaturation-aware schedulinggpu utilizationthroughput optimization
0
0 comments X

The pith

Optimus makes diffusion LLM decoding block size a runtime variable that a closed-loop scheduler adjusts to current load, sustaining high throughput without model retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fixed-block diffusion decoding wastes GPU cycles under light load and performs extra redundant work under heavy load because one block size cannot suit both regimes. Optimus splits generation into smaller executable chunks and lets a scheduler pick the chunk size at each step based on measured saturation. The combination keeps hardware busy while limiting wasted tokens. Experiments report the resulting throughput stays high across load changes and exceeds both autoregressive baselines and static diffusion baselines.

Core claim

Optimus enables elastic decoding for diffusion LLMs by combining chunked decoding, which permits fine-grained execution steps without retraining, with saturation-aware scheduling that selects chunk sizes from runtime observations. This treats decoding granularity as a controllable variable that trades off GPU utilization against token efficiency, yielding up to 6.1x higher throughput than autoregressive decoding and 4.3x higher than fixed-block diffusion while preserving output accuracy.

What carries the argument

Chunked decoding with saturation-aware closed-loop scheduling that selects decoding granularity at runtime to balance utilization and redundant computation.

If this is right

  • Throughput improves by up to 6.1 times relative to autoregressive decoding.
  • Throughput improves by up to 4.3 times relative to fixed-block diffusion decoding.
  • Performance stays stable as offered load moves from low to high utilization.
  • End-to-end serving capacity rises under latency SLOs because idle time and redundant work both decrease.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same chunk-and-schedule pattern could extend to other iterative non-autoregressive generators that currently fix their step size.
  • Variable-granularity execution might reduce power draw in multi-tenant clusters by keeping GPUs closer to full utilization without extra tokens.
  • Only the scheduler and kernel changes are needed, so the technique could be added to existing diffusion-LLM inference stacks without retraining.

Load-bearing premise

Runtime load can be measured accurately enough for the closed-loop scheduler to pick chunk sizes that improve utilization without adding overhead, and chunked decoding leaves the diffusion model's generated outputs unchanged.

What would settle it

Running the same diffusion model under rapidly varying load with fixed blocks versus Optimus and checking whether token-level accuracy or perplexity diverges when chunk sizes change.

Figures

Figures reproduced from arXiv: 2605.24832 by Bowen Duan, Chiyue Wei, Cong Guo, Danyang Zhuo, Hai "Helen" Li, Haoxuan Shan, Junyao Zhang, Yangjie Zhou, Yifei Wang, Yiran Chen.

Figure 1
Figure 1. Figure 1: Load sensitivity of fixed-granularity decoding. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between autoregressive decoding (left) and block diffusion decoding (right). AR generates one token per step, while diffusion-style decoding generates multiple tokens per iteration, batch size 1. These results demonstrate that Optimus makes diffusion LLM serving practical under dynamic workloads and pro￾vides a system-level foundation that may inform future algo￾rithm design. This study makes th… view at source ↗
Figure 3
Figure 3. Figure 3: Motivation for saturation-aware decoding. (a) AR underutilizes the GPU. (b) Diffusion improves utilization but suffers under high load. (c) Granularity trades off GPU and token utilization. (d) The optimal point tracks the saturation boundary. Experiments use Qwen3-8B (AR) and SDAR-8B (multiple block sizes, from 2 to 32) on an A100 80GB. more compute-dense executions that inherently align with the throughp… view at source ↗
Figure 4
Figure 4. Figure 4: Chunked decoding overview. (a) Block-wise diffusion decoding. (b) Prefix caching removes prefix dependency. (c) Suffix chunking enables fine-grained execution. (d) Streaming chunked decoding restores execution order and efficiency. exceeds a threshold are committed (Decoding → Decoded), while others remain masked. Although decoding tokens produce tentative outputs, their KV states are computed from masked … view at source ↗
Figure 5
Figure 5. Figure 5: GPU latency and the committed token modeling. the current batch size 𝑏 obtained from continuous batching. We next describe the modeling of each component in detail. 5.2 Modeling System Efficiency We model the decoding latency 𝑇latency (𝑐, 𝑏) based on the dominant GPU workload. In practice, latency is primarily determined by the fully connected (FC) layers, whose com￾putation scales with the total number of… view at source ↗
Figure 6
Figure 6. Figure 6: Scheduling comparison under a GPU capacity of 16 tokens (1 step = 1s under underload). Top: AR underutilizes the GPU. Middle: elastic decoding operates near saturation. Bottom: BD overloads the GPU. Elastic scheduling balances throughput and latency across workloads. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Model accuracy on common LLM benchmarks. request arrival traces using a Poisson arrival process. We set service-level objectives (SLOs) based on application re￾quirements, following common practice in prior LLM serving work [50]. For interactive chat workloads (ShareGPT and LMSYS-Chat-1M), we adopt a relatively stringent 50ms time￾per-output token (TPOT) SLO, as this is generally perceived as instantaneous… view at source ↗
Figure 8
Figure 8. Figure 8: Throughput scaling with batch size. (a) Different chunk sizes exhibit a load-dependent trade-off and form a Pareto frontier; (b) Optimus adapts chunk size at runtime and outperforms AR and fixed-block baselines. batches at every decoding iteration, enabling more flexi￾ble continuous batching and improved resource utilization. LMDeploy also provides an optimized block-wise paged￾attention kernel, making it … view at source ↗
Figure 9
Figure 9. Figure 9: Throughput comparison across batch sizes. 7.4 End-to-End Serving Performance We next evaluate Optimus in an end-to-end online serving setting, focusing on decoding performance since the prefill behavior is largely identical across methods. Following prior work [50, 54], we measure the P90 TPOT under gradually increasing request rates in [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: End-to-end online serving evaluation: P90 TPOT over different request rate. 2 4 6 8 Batch size 0.0 0.2 0.4 Probability mean = 1.8 median = 1.0 Batch size histogram (req/s = 0.5) 0 25 50 75 100 Batch size 0.00 0.01 0.02 0.03 0.04 Probability mean = 25.0 median = 23.0 Batch size histogram (req/s = 4.9) 0 10 20 30 Chunk size 0.00 0.25 0.50 0.75 1.00 Probability mean = 32.0 median = 32.0 Chunk size histogram … view at source ↗
Figure 12
Figure 12. Figure 12: Throughput across model scales and tensor￾parallel settings. decoding with block size 32 (BD32). We then enable chunked decoding along with fixed chunk sizes, without elastic sched￾uling, to isolate the effect of chunk granularity. Finally, we evaluate the full Optimus design, which combines chunked decoding with runtime elastic scheduling. We report P90 TPOT on ShareGPT using SDAR-8B in [PITH_FULL_IMAGE… view at source ↗
Figure 13
Figure 13. Figure 13: Ablation of chunked decoding and elastic sched￾uling. 8 Related Works LLM Inference Serving. Prior work on LLM serving im￾proves efficiency through scheduling, memory management, and system design within the autoregressive decoding par￾adigm. Orca [46] introduces iteration-level scheduling for dynamic batching. vLLM [22] improves throughput via Page￾dAttention and efficient KV-cache management. SGLang [49… view at source ↗
read the original abstract

Large language model (LLM) serving is fundamentally limited by inefficient hardware utilization. Autoregressive (AR) decoding underutilizes GPUs due to its strictly sequential execution, while diffusion LLMs (DLLMs) improve throughput by decoding multiple tokens per iteration. However, fixed block-size diffusion decoding exhibits strong load sensitivity: large blocks exploit idle GPU resources under low load, but saturate early and incur substantial redundant computation under high load. As a result, throughput gains vanish beyond saturation, and no single decoding granularity performs well across dynamic serving workloads. We present Optimus, a serving system that enables elastic decoding for diffusion LLMs by dynamically adapting decoding granularity to runtime load. The key idea is to treat decoding granularity as a runtime control variable, balancing GPU utilization and token efficiency. Optimus combines chunked decoding, which enables fine-grained execution without retraining, with saturation-aware scheduling, a closed-loop mechanism that selects chunk sizes based on runtime conditions. Together with system-level optimizations and customized attention kernels, Optimus achieves significant performance improvements while preserving model accuracy. Experiments show that Optimus delivers up to 6.1x throughput improvement over AR decoding and 4.3x improvement over fixed-block diffusion LLM, while maintaining stable performance across diverse load regimes and improving end-to-end serving capacity under latency constraints. The source code is available at https://github.com/dubcyfor3/Optimus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents Optimus, a serving system for diffusion LLMs that enables elastic decoding by treating decoding granularity as a runtime control variable. It combines chunked decoding (to allow fine-grained execution without retraining) with a saturation-aware closed-loop scheduler that selects chunk sizes based on monitored runtime load, plus system optimizations and custom attention kernels. The central claim is that this yields up to 6.1× throughput over autoregressive decoding and 4.3× over fixed-block diffusion decoding while preserving accuracy and delivering stable performance across load regimes.

Significance. If the empirical results hold under rigorous evaluation, the work would address a practical limitation in diffusion-LLM serving—load sensitivity of fixed block sizes—by providing a dynamic mechanism that improves GPU utilization without model changes. The open-source release is a positive factor for reproducibility in systems research.

major comments (2)
  1. [Abstract] Abstract: the headline throughput claims (6.1× vs. AR, 4.3× vs. fixed-block) are stated without any accompanying experimental details, baselines, error bars, accuracy metrics, load definitions, or hardware configuration, rendering the central empirical claim impossible to evaluate from the supplied text.
  2. [Scheduling mechanism (inferred from abstract)] Saturation-aware scheduling description: the closed-loop mechanism is asserted to select chunk sizes from runtime load “without meaningful overhead” and while preserving diffusion behavior, yet no measurements of monitoring cost, decision latency, or oscillation risk under intra-iteration load fluctuations are supplied; these quantities are load-bearing for the claim that elastic decoding outperforms the fixed-block baseline that the paper itself shows saturates early.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline throughput claims (6.1× vs. AR, 4.3× vs. fixed-block) are stated without any accompanying experimental details, baselines, error bars, accuracy metrics, load definitions, or hardware configuration, rendering the central empirical claim impossible to evaluate from the supplied text.

    Authors: We agree that the abstract, constrained by length, omits the supporting experimental details. The full manuscript provides these in Sections 4 (Experimental Setup) and 5 (Evaluation), including baselines (AR decoding and fixed-block diffusion), hardware (NVIDIA A100 GPUs), load definitions (request rates), accuracy metrics, and results with variability reporting. To improve standalone readability of the abstract, we will revise it to briefly reference the evaluation conditions and hardware. revision: partial

  2. Referee: [Scheduling mechanism (inferred from abstract)] Saturation-aware scheduling description: the closed-loop mechanism is asserted to select chunk sizes from runtime load “without meaningful overhead” and while preserving diffusion behavior, yet no measurements of monitoring cost, decision latency, or oscillation risk under intra-iteration load fluctuations are supplied; these quantities are load-bearing for the claim that elastic decoding outperforms the fixed-block baseline that the paper itself shows saturates early.

    Authors: Section 3.2 describes the saturation-aware closed-loop scheduler and its design for low overhead. We acknowledge that explicit measurements of monitoring cost, decision latency, and oscillation risk under fluctuating loads are not quantified in the current version. We will add these measurements (including overhead breakdowns and stability analysis) to the revised evaluation section to directly support the scheduler's claims relative to the fixed-block baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical systems contribution with no load-bearing derivations or self-citation chains

full rationale

The paper describes a serving system (Optimus) using chunked decoding and saturation-aware scheduling for diffusion LLMs. Throughput claims (6.1x vs AR, 4.3x vs fixed-block) are presented as experimental measurements under various loads, not as outputs of any mathematical derivation, fitted parameters, or uniqueness theorems. No equations, ansatzes, or self-citations are invoked to force the central results. The approach is self-contained as an engineering implementation with runtime monitoring and kernel optimizations; the reader's assessment of score 1.0 aligns with the absence of any reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; none can be extracted.

pith-pipeline@v0.9.1-grok · 5812 in / 1022 out tokens · 40039 ms · 2026-06-30T00:13:36.427222+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 16 canonical work pages · 7 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report

  2. [2]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Arney Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2025. Efficient LLM Inference via Chunked Prefills.SIGOPS Oper. Syst. Rev.59, 1 (Aug. 2025), 9–16. doi:10.1145/3759441.3759444

  3. [3]

    anon8231489123. 2023. ShareGPT Vicuna Unfiltered Dataset. https://huggingface.co/datasets/anon8231489123/ShareGPT_ Vicuna_unfiltered

  4. [4]

    Chiu, Zhihan Yang, Zhix- uan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov

    Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhix- uan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov

  5. [5]

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

    Block Diffusion: Interpolating Between Autoregressive and Dif- fusion Language Models. arXiv:2503.09573 [cs.LG]https://arxiv.org/ abs/2503.09573

  6. [6]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models

  7. [7]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2024. Longbench: A bilingual, multitask benchmark for long context under- standing. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 3119–3137

  8. [8]

    Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Li...

  9. [9]

    2018.The psychology of human-computer interaction

    Stuart K Card et al. 2018.The psychology of human-computer interaction. Crc Press, USA

  10. [10]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code

  11. [11]

    Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai Li, Yiran Chen, et al. 2025. Dpad: Efficient diffusion lan- guage models with suffix dropout

  12. [12]

    Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. 2025. SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation. arXiv:2510.06303 [cs.LG] https://arxiv.org/abs/2510.06303

  13. [13]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems, 2021

  14. [14]

    LMDeploy Contributors. 2023. LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM.https://github.com/InternLM/lmdeploy

  15. [15]

    Kamaluddeen Usman Danyaro, Maged Nasser, Abubakar Zakari, Shamsu Abdullahi, Atika Khanzada, Muhammad Muntasir Yakubu, Sara Shoaib, et al. 2025. LLM-Based Code Generation: A Systematic Literature Review With Technical and Demographic Insights.IEEE Access13 (2025), 194915–194939. 13

  16. [16]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer

  17. [17]

    int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332

    Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332

  18. [18]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

  19. [19]

    InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

    Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186

  20. [20]

    Shengyue Guan, Jindong Wang, Jiang Bian, Bin Zhu, Jian-Guang Lou, and Haoyi Xiong. 2026. Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey.ACM Trans. Intell. Syst. Technol.(Feb. 2026). doi:10.1145/3793671Just Accepted

  21. [21]

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ah- mad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Am- inabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al . 2024. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference

  22. [22]

    Yuezhou Hu, Jiaxin Guo, Xinyu Feng, and Tuo Zhao. 2025. AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders. arXiv:2510.19779 [cs.CL]https://arxiv.org/abs/2510.19779

  23. [23]

    Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S Abdelfattah, Jae- sun Seo, Zhiru Zhang, and Udit Gupta. 2025. Accelerating diffusion language model inference via efficient kv caching and guided diffusion. arXiv–2505 pages

  24. [24]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gi- anna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.0...

  25. [25]

    Gonzalez, Hao Zhang, and Ion Sto- ica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. Association for Comput- ing Machinery, New York, NY, USA, 611–626

  26. [26]

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA) (ICML’23). JMLR.org, USA, Article 795, 13 pages

  27. [27]

    Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan, Li Shen, Yi Liang, Soroush Vosoughi, and Shiwei Liu. 2025. Diffusion Language Models Know the Answer Before Decoding

  28. [28]

    Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tat- sunori B Hashimoto. 2022. Diffusion-lm improves controllable text generation.Advances in neural information processing systems35 (2022), 4328–4343

  29. [29]

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. Eagle: Speculative sampling requires rethinking feature uncertainty

  30. [30]

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. Eagle-3: Scaling up inference acceleration of large language models via training-time test

  31. [31]

    Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu Wang, Qinghan Chen, Shuhuai Lin, April Yang, Zhihao Zhang, Zhuoming Chen, Sean Lai, Xinhao Cheng, Xupeng Miao, and Zhihao Jia. 2025. AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding. arXiv:2501.12162 [cs.CL]https://arxiv.org/abs/ 2501.12162

  32. [32]

    Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, and Jie Zhou. 2025. Wedlm: Reconciling diffusion language models with standard causal attention for fast inference

  33. [33]

    Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. 2025. TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput. arXiv:2406.14066 [cs.AI]https://arxiv.org/abs/2406.14066

  34. [34]

    Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. 2025. dllm-cache: Accelerating diffusion large language models with adaptive caching

  35. [35]

    Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. 2025. dkv-cache: The cache for diffusion language models

  36. [36]

    Mehran Nasseri, Patrick Brandtner, Robert Zimmermann, Taha Fala- touri, Farzaneh Darbanian, and Tobechi Obinwanne. 2023. Applica- tions of large language models (llms) in business analytics–exemplary use cases in data preparation tasks. InInternational conference on human-computer interaction. Springer, Springer-Verlag, Berlin, Heidel- berg, 182–198

  37. [37]

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. 2025. Large Language Diffusion Models. arXiv:2502.09992 [cs.CL]https: //arxiv.org/abs/2502.09992

  38. [38]

    NVIDIA. 2026. Matrix Multiplication Background User’s Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance- matrix-multiplication/index.html

  39. [39]

    Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150 [cs.NE]https://arxiv.org/abs/1911. 02150

  40. [40]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism

  41. [41]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multi- modal models

  42. [42]

    Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou. 2025. Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models. arXiv:2507.17702 [cs.CL]https: //arxiv.org/abs/2507.17702

  43. [43]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Con- ference on Neural Information Processing Systems(Long Beach, Cali- fornia, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010

  44. [44]

    Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. 2025. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing

  45. [45]

    Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore ar- chitectures.Commun. ACM52, 4 (April 2009), 65–76. doi:10.1145/ 1498765.1498785

  46. [46]

    Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. 2025. Fast-dllm v2: Efficient block-diffusion llm

  47. [47]

    Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. 2025. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding

  48. [48]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  49. [49]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for 14 {Transformer-Based} generative models. In16th USENIX symposium on operating systems design and implementation (OSDI 22). USENIX Association, USA, 521–538

  50. [50]

    Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. 2023. Are trans- formers effective for time series forecasting?. InProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty- Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelli- gence (AA...

  51. [51]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P Xing, et al. 2023. Lmsys-chat-1m: A large-scale real-world llm conversation dataset

  52. [52]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured lan- guage model programs.Advances in neural information processing systems37 (2024), 62557–62583

  53. [53]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, USA, 193–210

  54. [54]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serv- ing. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation(Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Art...

  55. [55]

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models

  56. [56]

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongx- uan Li. 2025. LLaDA 1.5: Variance-Reduced Preference Optimiza- tion for Large Language Diffusion Models. arXiv:2505.19223 [cs.LG] https://arxiv.org/abs/2505.19223

  57. [57]

    Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, et al. 2025. {NanoFlow}: Towards optimal large language model serving through- put. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, USA, 749–765. 15