Recognition: no theorem link
NPU Design for Diffusion Language Model Inference
Pith reviewed 2026-05-16 09:57 UTC · model grok-4.3
The pith
A custom NPU with new ISA and Block-Adaptive Online Smoothing is the first accelerator built for diffusion language model inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the first NPU accelerator specifically designed for dLLMs. It delivers a dLLM-oriented ISA and compiler, a hardware-optimized execution model for both the transformer inference and diffusion sampling used in dLLMs, a novel Block-Adaptive Online Smoothing (BAOS) for quantizing KV cache in dLLMs, and a complete RTL implementation synthesized in 7nm. Evaluation relies on a tri-path simulation framework comprising analytical, cycle-accurate, and accuracy simulators together with cross-validations against physical hardware.
What carries the argument
The dLLM-oriented ISA together with the Block-Adaptive Online Smoothing (BAOS) scheme that performs online per-block smoothing to handle step-wise KV cache distribution shifts during blocked diffusion inference.
If this is right
- dLLM inference can now target a memory hierarchy and sampling units matched to bidirectional attention and iterative block refresh.
- KV cache quantization becomes viable for dLLMs by tracking per-block activation shifts instead of assuming static distributions.
- The open-sourced ISA, compiler, and simulation stack allow other groups to extend the accelerator for larger dLLM variants.
- Non-GEMM top-k sampling stages receive dedicated hardware support rather than falling back to general-purpose cores.
Where Pith is reading between the lines
- Future NPU designs may incorporate configurable attention direction and refresh modes as first-class ISA features once dLLMs gain traction.
- The BAOS technique could be tested on other iterative generative models that also exhibit block-local distribution drift.
- If dLLMs scale to longer contexts, the blocked KV organization may reduce overall memory bandwidth compared with append-only AR caches.
Load-bearing premise
That diffusion language models will become common enough that their blocked KV refreshing and top-k sampling patterns cannot be supported efficiently by existing autoregressive NPUs without a custom ISA and quantization method.
What would settle it
Measure end-to-end inference latency and energy per token of a dLLM workload on the proposed 7 nm NPU versus the same workload running on a contemporary AR NPU; a large gap in favor of the new design would support the claim.
Figures
read the original abstract
Diffusion-based LLMs (dLLMs) fundamentally depart from traditional autoregressive (AR) LLM inference: they leverage bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and a non-GEMM-centric sampling phase. These characteristics make current dLLMs incompatible with most existing NPUs, as their inference patterns, in particular the reduction-heavy, top-$k$-driven sampling stage, demand new ISA and memory hierarchy support beyond that of AR accelerators. In addition, the blocked diffusion KV cache breaks from the append-only paradigm assumed by AR NPUs, and conventional AR-derived KV quantization schemes were designed for static activation distributions and do not account for the step-wise distribution shifts introduced by iterative block-wise refinement in dLLMs. In this paper, we introduce the first NPU accelerator specifically designed for dLLMs. It delivers: a dLLM-oriented ISA and compiler; a hardware-optimized execution model for both the transformer inference and diffusion sampling used in dLLMs; a novel Block-Adaptive Online Smoothing (BAOS) for quantizing KV cache in dLLMs; and a complete RTL implementation synthesized in 7nm. To evaluate and validate our design, we introduce a tri-path simulation framework that comprises analytical, cycle-accurate, and accuracy simulators, together with cross-validations against physical hardware. The full NPU stack, including ISA, simulation tools, and quantization software, will be open-sourced upon acceptance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the design of the first NPU accelerator tailored for diffusion-based language models (dLLMs). Key contributions include a dLLM-specific ISA and compiler, a hardware execution model handling bidirectional attention and diffusion sampling, the Block-Adaptive Online Smoothing (BAOS) method for dynamic KV cache quantization, and a full RTL design synthesized in 7nm technology. Evaluation relies on a tri-path simulator (analytical, cycle-accurate, accuracy) with hardware cross-validation, and the authors commit to open-sourcing the entire stack.
Significance. Should the quantitative results support the claims, this work would be significant in addressing the mismatch between dLLM inference patterns and existing AR-oriented NPUs. The open-sourcing of ISA, compiler, and tools is a notable strength that could facilitate further research in this emerging area.
major comments (2)
- [Abstract] Abstract: The central claim of a 'complete RTL implementation synthesized in 7nm' is load-bearing for the practicality of the design, yet the abstract supplies no quantitative metrics on area, power, frequency, or speedup versus AR NPUs; without these, the assertion that existing accelerators are insufficient cannot be evaluated.
- [Evaluation] Evaluation: The tri-path simulation framework is presented as the validation vehicle, but no details appear on cross-validation error bounds, specific benchmarks, or how accuracy simulator results map to the cycle-accurate model; this directly affects assessment of BAOS and the execution model.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of a 'complete RTL implementation synthesized in 7nm' is load-bearing for the practicality of the design, yet the abstract supplies no quantitative metrics on area, power, frequency, or speedup versus AR NPUs; without these, the assertion that existing accelerators are insufficient cannot be evaluated.
Authors: We agree that the abstract should include key quantitative metrics to support the central claims. The full manuscript reports 7nm synthesis results with specific figures for area, power, frequency, and speedups versus AR NPUs, but these were omitted from the abstract for brevity. We will revise the abstract to incorporate these metrics (e.g., achieved frequency, area, power, and comparative performance gains) so that the practicality assertion can be directly evaluated. revision: yes
-
Referee: [Evaluation] Evaluation: The tri-path simulation framework is presented as the validation vehicle, but no details appear on cross-validation error bounds, specific benchmarks, or how accuracy simulator results map to the cycle-accurate model; this directly affects assessment of BAOS and the execution model.
Authors: We acknowledge that additional transparency on the tri-path framework is warranted. The manuscript describes the analytical, cycle-accurate, and accuracy simulators along with hardware cross-validation, but we will expand the evaluation section to specify the exact benchmarks, report cross-validation error bounds, and detail the mapping methodology between the accuracy simulator outputs and the cycle-accurate model. This will enable better assessment of BAOS and the execution model. revision: yes
Circularity Check
No significant circularity in hardware design proposal
full rationale
The paper is a hardware architecture and systems proposal for dLLM inference, introducing an ISA, compiler, execution model, BAOS KV quantization, and 7nm RTL. No mathematical derivation chain, equations, or predictions are present that reduce to fitted parameters, self-definitions, or self-citation load-bearing steps. The design is presented as self-contained with explicit tri-path simulation cross-validated to physical hardware and plans for open-sourcing, satisfying the criteria for an independent contribution without circular reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption dLLMs leverage bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and a non-GEMM-centric sampling phase
invented entities (1)
-
Block-Adaptive Online Smoothing (BAOS)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Discrete diffusion in large language and multimodal models: A survey,
R. Yu, Q. Li, and X. Wang, “Discrete diffusion in large language and multimodal models: A survey,”arXiv preprint arXiv:2506.13759, 2025
-
[2]
dinfer: An efficient inference framework for diffusion language models,
Y . Ma, L. Du, L. Wei, K. Chen, Q. Xu, K. Wang, G. Feng, G. Lu, L. Liu, X. Qiet al., “dinfer: An efficient inference framework for diffusion language models,”arXiv preprint arXiv:2510.08666, 2025
-
[3]
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
C. Wu, H. Zhang, S. Xue, Z. Liu, S. Diao, L. Zhu, P. Luo, S. Han, and E. Xie, “Fast-dllm: Training-free acceleration of diffu- sion llm by enabling kv cache and parallel decoding,”arXiv preprint arXiv:2505.22618, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Large Language Diffusion Models
S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.- R. Wen, and C. Li, “Large language diffusion models,”arXiv preprint arXiv:2502.09992, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Dream 7B: Diffusion Large Language Models
J. Ye, Z. Xie, L. Zheng, J. Gao, Z. Wu, X. Jiang, Z. Li, and L. Kong, “Dream 7b: Diffusion large language models,”arXiv preprint arXiv:2508.15487, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
H. Wu, C. Xiao, J. Nie, X. Guo, B. Lou, J. T. Wong, Z. Mo, C. Zhang, P. Forys, W. Luket al., “Combating the memory walls: Optimization pathways for long-context agentic llm inference,”arXiv preprint arXiv:2509.09505, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Tandem processor: Grappling with emerging operators in neural networks,
S. Ghodrati, S. Kinzer, H. Xu, R. Mahapatra, Y . Kim, B. H. Ahn, D. K. Wang, L. Karthikeyan, A. Yazdanbakhsh, J. Parket al., “Tandem processor: Grappling with emerging operators in neural networks,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1165–1182
work page 2024
-
[8]
Fast on-device llm inference with npus,
D. Xu, H. Zhang, L. Yang, R. Liu, G. Huang, M. Xu, and X. Liu, “Fast on-device llm inference with npus,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 445–462
work page 2025
-
[9]
Mandheling: Mixed-precision on-device dnn training with dsp offloading,
D. Xu, M. Xu, Q. Wang, S. Wang, Y . Ma, K. Huang, G. Huang, X. Jin, and X. Liu, “Mandheling: Mixed-precision on-device dnn training with dsp offloading,” inProceedings of the 28th Annual International Conference on Mobile Computing And Networking, 2022, pp. 214–227
work page 2022
-
[10]
Scaling llm test-time compute with mobile npu on smart- phones,
Z. Hao, J. Wei, T. Wang, M. Huang, H. Jiang, S. Jiang, T. Cao, and J. Ren, “Scaling llm test-time compute with mobile npu on smart- phones,”arXiv preprint arXiv:2509.23324, 2025
-
[11]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,”arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Ramulator 2.0: A modern, modular, and extensible dram simulator,
H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible dram simulator,”IEEE Computer Architecture Letters, vol. 23, no. 1, pp. 112– 116, 2023
work page 2023
-
[13]
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
M. Arriola, A. Gokaslan, J. T. Chiu, Z. Yang, Z. Qi, J. Han, S. S. Sahoo, and V . Kuleshov, “Block diffusion: Interpolating between autoregressive and diffusion language models,”arXiv preprint arXiv:2503.09573, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Asap7: A 7-nm finfet predictive process design kit,
L. T. Clark, V . Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, and G. Yeric, “Asap7: A 7-nm finfet predictive process design kit,”Microelectronics Journal, vol. 53, pp. 105–115, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.