pith. sign in

arxiv: 2606.21633 · v1 · pith:MZTA6KIOnew · submitted 2026-06-19 · 💻 cs.LG

HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval

Pith reviewed 2026-06-26 14:42 UTC · model grok-4.3

classification 💻 cs.LG
keywords diffusion LLMsKV cache offloadingblock diffusionCPU-GPU cooperationlong context inferencesparse attentionthroughput optimizationdenoising steps
0
0 comments X

The pith

HERALD lets block diffusion LLMs keep only 5-10 percent of the KV cache on GPU by picking the needed entries once per block and reusing them across denoising steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a KV offloading method for block diffusion LLMs that moves most of the growing cache to CPU memory while still delivering near-lossless accuracy on long-context tasks. It rests on the observation that the important KV entries stay stable inside each block of denoising steps, so the system selects the top entries once and reuses that choice for the whole block. Two engineering steps cut the selection cost by roughly the block size and let the selection work run in parallel with denoising on the GPU. Experiments across three models and five tasks show the approach matches full-cache accuracy at 5-10 percent GPU budget while cutting per-block latency by up to 1.59 times and raising throughput by up to 2.47 times, with larger gains at longer contexts.

Core claim

In block diffusion LLMs the relevant KV entries remain consistent across denoising steps within a block, so top-k selection performed once can be reused for all steps inside the block; HERALD exploits this by cutting the required selection compute by a factor of the block size and overlapping the remaining work with denoising through CPU-GPU cooperative retrieval, thereby supporting sparse offloading without accuracy loss.

What carries the argument

CPU-GPU cooperative KV cache retrieval that performs one-time top-k selection per block and overlaps the reduced selection work with denoising.

If this is right

  • Near-lossless accuracy holds at 5-10 percent KV budget on GPU across three block dLLMs and five long-context tasks.
  • Per-block latency drops by up to 1.59 times compared with GPU-only inference.
  • Throughput rises by up to 2.47 times, and the speedup grows as context length increases.
  • The selection overhead is amortized across the entire block rather than repeated each step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reuse pattern could be tested in other multi-step generation schemes where cache relevance is stable inside fixed windows.
  • Dynamic adjustment of the KV budget per block might further reduce memory use on tasks with varying attention sparsity.
  • The overlap technique could be combined with other offloading policies that already move less-critical layers to CPU.

Load-bearing premise

The relevant KV entries remain consistent across denoising steps within a block.

What would settle it

A measurement showing that the set of top-k KV entries changes substantially between successive denoising steps inside the same block, causing accuracy to drop when the once-selected set is reused.

Figures

Figures reproduced from arXiv: 2606.21633 by Doyeon Kim, Ion Stoica, Jae W. Lee, Jongseok Park, Omin Kwon, Seung Yul Lee.

Figure 1
Figure 1. Figure 1: Challenges HERALD addresses and its key ideas. consecutive tokens at a time, starting from 𝐵 MASK place￾holders that mark undetermined positions and refining them over 𝑇 (< 𝐵) denoising steps. Each step performs a single forward pass over all 𝐵 positions, unmasking multiple posi￾tions in parallel, so an entire block is decoded in only𝑇 steps, whereas AR decoding requires 𝐵 separate forward passes. The KV c… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of decoding processes. (a) AR LLM: Token-by-token generation per forward pass using standard causal attention. (b) Block Diffusion LLM: Block-wise par￾allel generation over multiple forward passes. Block causal attention mask enables bidirectional attention within the current block while attending to fixed cached tokens. KV selection tractable on the CPU and removing it from the critical path. •… view at source ↗
Figure 3
Figure 3. Figure 3: KV cache pressure and offloading overhead in LLaDA 2.0 (𝐵 = 64, 8K–32K context). (a) GPU memory break￾down showing KV cache scaling beyond the 80 GB limit of a single H100. (b) Latency of fetching the full KV cache from host DRAM, normalized to the model forward time. memory limit, the maximum achievable batch size must be significantly scaled back as context increases: it drops from 70 at 8K context to me… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Offloading Methods applied to Block Diffusion LLM cache from CPU on demand. SpeContext [44] employs a learned distilled language model as a retrieval head to de￾couple KV selection from LLM inference, and transfers only the delta between consecutive selections to reduce fetch vol￾ume. A broader line of sparse-attention and retrieval works— such as Quest [37], FreeKV [26], LouisKV [43], and Cl… view at source ↗
Figure 6
Figure 6. Figure 6: (a) Top-𝑘 recall of a single center [MASK] query vs. the full 𝐵-query union. (b) CPU attention speedup when reducing from 𝐵=32 queries to 𝑁 (8K–32K; measured on Intel Xeon Platinum 8481C). Reducing to 𝑁=1 yields 22– 24× speedup while retaining 84% top-𝑘 recall. critical path: steps 1–𝑇 must stall until selection completes, and selection dominates per-block latency. A natural remedy is to begin block 𝑁+1’s … view at source ↗
Figure 7
Figure 7. Figure 7: System overview of Herald. Left: the prefill phase appends a [MASK] block to the prompt, offloads KV states to CPU DRAM layer-wise, and performs top-𝑘 selection to bootstrap the sparse KV pool. Right: the decode phase operates two concurrent streams with double-buffered KV pools; the retrieval stream writes the next block’s entries to the idle pool while the main stream denoises from the active pool. Block… view at source ↗
Figure 8
Figure 8. Figure 8: Full pipeline of HERALD over two consecutive blocks. The main stream denoises block 𝑁 while the retrieval stream concurrently prepares block 𝑁+1’s Top-𝑘 entries through cooperative CPU–GPU execution, completing before denoising finishes. At the block boundary, the double-buffered KV pools are swapped and the pattern repeats for block 𝑁+1. selection over the prompt’s KV cache using its attention scores. The… view at source ↗
Figure 9
Figure 9. Figure 9: LongBench accuracy across all three models and five tasks. HERALD closely tracks the Dense baseline at 5–10% KV cache budget, while InfiniGen degrades substantially even at high budgets. MAGE is consistently robust across all settings. SXM 80 GB GPU (3.35 TB/s HBM3 bandwidth), connected via PCIe 5.0 ×16 (64GB/s). Models. We evaluate on three block diffusion LLMs: SDAR￾8B-Chat [5], TraDo-8B-Instruct [38], a… view at source ↗
Figure 10
Figure 10. Figure 10: Per-block decode latency (TPOB, ms) at batch sizes 𝐵=1, 16, 32, and 64 across context lengths. Columns correspond to models. Dense runs the full KV cache in HBM; all other methods offload KV to host DRAM. Dashed red bars indicate GPU out-of-memory. Dense. At the largest feasible batch sizes, HERALD is up to 8.9× faster than InfiniGen and 4.9× faster than MAGE. HERALD achieves Dense-level or even lower TPO… view at source ↗
Figure 11
Figure 11. Figure 11: Per-block latency breakdown of HERALD on LLaDA 2.0-mini 16B at two context lengths (8K, 32K) and three batch sizes (𝐵=16, 32, 64). and batch size grow. We believe this is because the CPU’s weaker compute is already fully utilized by the retrieval ker￾nel, so its latency scales linearly with the workload, while the GPU still has compute headroom and grows sublinearly. 6.5 Ablation Study To isolate each com… view at source ↗
read the original abstract

Diffusion LLMs (dLLMs) improve GPU utilization over autoregressive decoding by generating multiple tokens per forward pass, but their KV cache still grows linearly with context, limiting throughput at long contexts. KV cache offloading to host DRAM alleviates this memory pressure, but the limited PCIe bandwidth necessitates recalling only a sparse subset of KV entries. In block dLLMs, the relevant KV entries remain consistent across denoising steps within a block, enabling high-accuracy selection by identifying the top-k entries once and reusing them throughout all denoising steps. This property appears attractive for offloading as it amortizes the selection overhead across the entire block, but it requires exact attention over the full KV cache, which is too expensive under offloading. We present HERALD, a KV offloading system for block dLLMs that resolves this through two opportunities that reduce the required selection compute by a factor of the block size and enable selection to be overlapped with denoising. Across three block dLLMs and five long context tasks, HERALD achieves near-lossless accuracy at 5-10% KV budget and up to 1.59x lower per block latency and 2.47x higher throughput over GPU-only inference, with speedups growing with context length.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents HERALD, a KV offloading system for block diffusion LLMs (dLLMs) that exploits the consistency of relevant KV entries across denoising steps within a block to amortize sparse top-k selection from CPU DRAM to GPU. It claims this enables near-lossless accuracy at 5-10% KV budget across three block dLLMs and five long-context tasks, along with up to 1.59x lower per-block latency and 2.47x higher throughput versus GPU-only inference, with speedups increasing with context length. The system uses two (unspecified in the abstract) opportunities to reduce selection compute by the block-size factor and overlap it with denoising.

Significance. If the empirical claims hold under the stated KV-consistency assumption, the work would be significant for efficient long-context dLLM serving by demonstrating practical CPU-GPU cooperative offloading that avoids full-cache attention costs. The amortization of selection overhead and context-length scaling are potentially impactful contributions to the systems-for-ML literature.

major comments (2)
  1. [Abstract] Abstract (and § on KV selection): the near-lossless accuracy claim at 5-10% KV budget is load-bearing on the assumption that relevant KV entries remain sufficiently consistent across denoising steps within a block so that a single top-k selection can be reused; the manuscript must provide quantitative evidence (e.g., attention-score variance, per-step accuracy drop when reusing the selection, or sensitivity analysis) rather than stating the property, because any material violation would invalidate the reported accuracy and the downstream latency/throughput numbers.
  2. [Experiments] Experimental section (results on three dLLMs and five tasks): the reported 1.59x latency and 2.47x throughput improvements lack visible error bars, ablation data on the two selection-compute-reduction opportunities, or controls for PCIe bandwidth variation; without these, it is impossible to assess whether the speedups are robust or whether they depend on the unverified consistency assumption.
minor comments (2)
  1. [Abstract] The abstract should name the three block dLLMs and five tasks to allow readers to assess generality.
  2. [System Design] Notation for the two opportunities that cut selection compute by the block-size factor should be introduced with a short equation or pseudocode in the system-overview section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of the consistency assumption and experimental robustness. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and § on KV selection): the near-lossless accuracy claim at 5-10% KV budget is load-bearing on the assumption that relevant KV entries remain sufficiently consistent across denoising steps within a block so that a single top-k selection can be reused; the manuscript must provide quantitative evidence (e.g., attention-score variance, per-step accuracy drop when reusing the selection, or sensitivity analysis) rather than stating the property, because any material violation would invalidate the reported accuracy and the downstream latency/throughput numbers.

    Authors: We agree that the consistency property is central and that the manuscript currently relies on stating the observed behavior in block dLLMs without accompanying quantitative metrics. The accuracy results across tasks provide indirect support, but explicit evidence such as attention-score variance across steps or accuracy sensitivity to reuse would directly address the concern. We will add a dedicated analysis subsection with these measurements in the revision. revision: yes

  2. Referee: [Experiments] Experimental section (results on three dLLMs and five tasks): the reported 1.59x latency and 2.47x throughput improvements lack visible error bars, ablation data on the two selection-compute-reduction opportunities, or controls for PCIe bandwidth variation; without these, it is impossible to assess whether the speedups are robust or whether they depend on the unverified consistency assumption.

    Authors: The reported numbers are means over repeated runs on the evaluated hardware, but we acknowledge the absence of error bars, explicit ablations isolating the two compute-reduction opportunities, and PCIe bandwidth controls limits assessment of robustness. We will add error bars, ablation studies, and a discussion of bandwidth sensitivity in the revised experimental section. The accuracy results provide supporting evidence for the consistency assumption, but the additional controls will clarify its role. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical systems paper with no derivation chain

full rationale

The paper describes a KV offloading system for block dLLMs and reports empirical throughput/accuracy results. The central premise is an observed property (KV entry consistency across denoising steps within a block) that enables amortized top-k selection; this is presented as an input assumption rather than derived from any equation or prior result within the paper. No self-definitional loops, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear. Performance numbers are direct measurements, not outputs of a closed mathematical chain. The contribution is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.1-grok · 5773 in / 1006 out tokens · 15286 ms · 2026-06-26T14:42:07.583033+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 8 canonical work pages

  1. [1]

    Chiu, Zhihan Yang, Zhix- uan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov

    Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhix- uan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov

  2. [2]

    InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025

    Block Diffusion: Interpolating Between Autoregressive and Dif- fusion Language Models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net.https://openreview.net/forum?id=tyEyYT267x

  3. [3]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhid- ian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Paper...

  4. [4]

    Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Li...

  5. [5]

    Josue Caldas and Elvis de Souza. 2025. A Comprehensive Evaluation of Large Language Models for Retrieval-Augmented Generation under Noisy Conditions. InProceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025), Aman Sinha, Raúl Vázquez, Timothee Mickus, Rohit Agarwal, Ioana Buh...

  6. [6]

    Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. 2025. SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation. arXiv:2510.06303 [cs.LG] https://arxiv.org/abs/2510.06303

  7. [7]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

  8. [8]

    InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22)

    FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 1189, 16 pages

  9. [9]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.109972, 1 (2023), 32

  10. [10]

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. 2025. Scaling Diffusion Language Models via Adaptation from Autoregressive Models. InThe Thirteenth Interna- tional Conference on Learning Representations.https://openreview.net/ forum?id=j1tSLYKwg8

  11. [11]

    Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. 2023. DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models. InThe Eleventh International Conference on Learning Representations.https://openreview.net/forum?id=jQj-_rLVXsj

  12. [12]

    Google DeepMind. 2025. Gemini Diffusion.https://deepmind.google/ models/gemini-diffusion/. Accessed: 2026-04-14

  13. [13]

    Yufeng Gu, Alireza Khadem, Sumanth Umesh, Ning Liang, Xavier Servot, Onur Mutlu, Ravi Iyer, and Reetuparna Das. 2025. Pim is all you need: A cxl-enabled gpu-free system for large language model inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 862–881

  14. [14]

    Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami

    Coleman Richard Charles Hooper, Sehoon Kim, Hiva Moham- madzadeh, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami. 2024. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems.https: //openreview.net/forum?id=0LXotew9Du

  15. [15]

    Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. 2025. NEO: Saving GPU Memory Crisis with CPU Offload- ing for Online LLM Inference. InProceedings of Machine Learn- ing and Systems, M. Zaharia, G. Joshi, and Y. Lin (Eds.), Vol. 7. MLSys.https://proceedings.mlsys.org/paper_files/paper/2025/file/ 66a026c0d17040889b50f0dfa650e5e0-Paper-Conference.pdf

  16. [16]

    Hyungyo Kim, Nachuan Wang, Qirong Xia, Jinghan Huang, Amir Yazdanbakhsh, and Nam Sung Kim. 2025. LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). As- sociation for Computing Machinery, New York, NY,...

  17. [17]

    Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. 2025. Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions. arXiv:2502.06768 [cs.LG] https://arxiv.org/abs/2502.06768

  18. [18]

    Lee, Sangdoo Yun, and Hyun Oh Song

    Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, and Hyun Oh Song. 2025. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems.https: //openreview.net/forum?id=JFygzwx8SJ

  19. [19]

    KyungSoo Kim, Omin Kwon, Yeonhong Park, and Jae W. Lee. 2025. AiDE: Attention-FFN Disaggregated Execution for Cost-Effective LLM Decoding on CXL-PNM .IEEE Computer Architecture Letters24, 02 (July 2025), 285–288.https://doi.org/10.1109/LCA.2025.3597323

  20. [20]

    Mahoney, Kurt Keutzer, and Amir Gholami

    Minseo Kim, Coleman Hooper, Aditya Tomar, Chenfeng Xu, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami

  21. [21]

    Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models.arXiv preprint arXiv:2510.04146(2025).https://arxiv.org/abs/2510.04146

  22. [22]

    Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, and Amir Gholami. 2026. CDLM: Consistency Diffusion Language Models For Faster Sampling. arXiv:2511.19269 [cs.LG]https://arxiv.org/abs/2511.19269

  23. [23]

    Omin Kwon, Yeonjae Kim, Doyeon Kim, Minseo Kim, Yeonhong Park, and Jae W. Lee. 2026. MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM. arXiv:2602.14209 [cs.LG]https: //arxiv.org/abs/2602.14209 12

  24. [24]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  25. [25]

    Gonzalez and Hao Zhang and Ion Stoica , title =

    Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). As- sociation for Computing Machinery, New York, NY, USA, 611–626. https://doi.org/10.1145/3600006.3613165

  26. [26]

    Haeun Lee, Omin Kwon, Yeonhong Park, and Jae W. Lee. 2026. Nest- edFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems.https://openreview.net/forum? id=WDAKFpWftI

  27. [27]

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: efficient generative inference of large language models with dynamic KV cache management. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Article 9, 18 pages

  28. [28]

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192 [cs.LG]https://arxiv.org/abs/2211.17192

  29. [29]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks. InAd- vances in Neural Information Processing Systems, H. Larochelle, M. Ran- zato, R. Hadsell...

  30. [30]

    Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Dan- ning Ke, Minyi Guo, and Jieru Zhao. 2026. FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference. InThe Fourteenth Inter- national Conference on Learning Representations.https://openreview. net/forum?id=wXAn7orB1H

  31. [31]

    Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. 2025. ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression. In2025 62nd ACM/IEEE Design Automation Conference (DAC). 1–7.https://doi.org/10.1109/DAC63849. 2025.11132479

  32. [32]

    Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. 2025. dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching. arXiv:2506.06295 [cs.LG]https://arxiv.org/abs/2506.06295

  33. [33]

    Aaron Lou, Chenlin Meng, and Stefano Ermon. 2024. Discrete dif- fusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning (Vienna, Austria)(ICML’24). JMLR.org, Article 1333, 30 pages

  34. [34]

    Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. 2025. dKV-Cache: The Cache for Diffusion Language Models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=Gppo2JImHs

  35. [35]

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. 2025. Large Language Diffusion Models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems.https://openreview.net/forum? id=KnqiC0znVF

  36. [36]

    Subham Sekhar Sahoo, Marianne Arriola, Aaron Gokaslan, Edgar Mar- iano Marroquin, Alexander M Rush, Yair Schiff, Justin T Chiu, and Volodymyr Kuleshov. 2024. Simple and Effective Masked Diffusion Language Models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems.https://openreview.net/forum?id= L4uaAR4ArM

  37. [37]

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: high-throughput generative inference of large language models with a single GPU. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA) (ICML’23). JMLR.org, Article 1288...

  38. [38]

    Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, and Xipeng Qiu. 2025. Sparse- dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction. arXiv:2508.02558 [cs.CL]https://arxiv.org/abs/2508.02558

  39. [39]

    Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1348–1362.https://doi.org/10.1109/hpca61900.2025.00102

  40. [40]

    Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2025. SHADOWKV: KV cache in shadows for high-throughput long-context LLM inference. InProceedings of the 42nd International Conference on Machine Learning (Vancouver, Canada)(ICML’25). JMLR.org, Article 2276, 19 pages

  41. [41]

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference. InForty-first International Conference on Machine Learning.https://openreview.net/forum?id=KzACYw0MTV

  42. [42]

    Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. 2025. Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models. arXiv:2509.06949 [cs.CL]https: //arxiv.org/abs/2509.06949

  43. [43]

    Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, and Xinchao Wang. 2026. SparseD: Sparse Attention for Diffusion Language Models. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=dwbrZtYP04

  44. [44]

    Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Puyu Zeng, Yuxuan Wang, Biqing Qi, Dongrui Liu, and Linfeng Zhang. 2026. Accelerating Dif- fusion Large Language Models with SlowFast Sampling: The Three Golden Principles. InThe Fourteenth International Conference on Learn- ing Representations.https://openreview.net/forum?id=Uh17FiwF4q

  45. [45]

    Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie

  46. [46]

    InThe Four- teenth International Conference on Learning Representations.https: //openreview.net/forum?id=1NZ3DHF9nT

    Fast-dLLM v2: Efficient Block-Diffusion LLM. InThe Four- teenth International Conference on Learning Representations.https: //openreview.net/forum?id=1NZ3DHF9nT

  47. [47]

    Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. 2025. Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding. arXiv:2505.22618 [cs.CL]https://arxiv.org/ abs/2505.22618

  48. [48]

    Wenbo Wu, Qingyi Si, Xiurui Pan, Ye Wang, and Jie Zhang. 2025. LouisKV: Efficient KV Cache Retrieval for Long Input-Output Se- quences. arXiv:2510.11292 [cs.LG]https://arxiv.org/abs/2510.11292

  49. [49]

    Jiaming Xu, Jiayi Pan, Hanzhen Wang, Yongkang Zhou, Jiancai Ye, Yu Wang, and Guohao Dai. 2026. SpeContext: Enabling Efficient Long- context Reasoning with Speculative Context Sparsity in LLMs. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(USA)(ASPLOS ’26). Associa...

  50. [50]

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. 2025. Dream 7B: Diffusion Large Language Models.arXiv preprint arXiv:2508.15487(2025).https: //arXiv.org/abs/2508.15487

  51. [51]

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. 2025. FlashInfer: Efficient and Cus- tomizable Attention Engine for LLM Inference Serving. InEighth Conference on Machine Learning and Systems.https://openreview.net/ 13 Omin Kwon, Doyeon Kim, Jongseok ...

  52. [52]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Asso- ciation, Carlsbad, CA, 521–538.https://www.usenix.org/conference/ osdi22/presentation/yu

  53. [53]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: effi- cient execution of structured language model programs. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC,...

  54. [54]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serv- ing. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation(Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Art...