arxiv: 2601.20706 · v2 · submitted 2026-01-28 · 💻 cs.AR · cs.AI· cs.DC

Recognition: no theorem link

NPU Design for Diffusion Language Model Inference

Binglei Lou , Haoran Wu , Kevin Lau , Gregor MacDonald , Jiayi Nie , Yao Lai , Can Xiao , Xuan Guo

show 4 more authors

Jianyi Cheng Rika Antonova Robert Mullins Aaron Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:57 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DC

keywords NPU acceleratordiffusion language modelsKV cache quantizationcustom ISAinference hardwareblock-wise KV cachetransformer sampling

0 comments

The pith

A custom NPU with new ISA and Block-Adaptive Online Smoothing is the first accelerator built for diffusion language model inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that diffusion-based LLMs require dedicated hardware because their bidirectional attention, block-wise KV cache updates, and non-GEMM sampling steps do not map efficiently onto existing autoregressive NPUs. It delivers a dLLM-specific instruction set, an execution model covering both transformer passes and diffusion sampling, and a quantization method called BAOS that adapts to per-block distribution shifts during iterative refinement. A full RTL implementation in 7 nm plus a tri-path simulator are provided to show the design works. If correct, this removes a hardware barrier that would otherwise force dLLM workloads onto mismatched accelerators or software-only runtimes.

Core claim

We introduce the first NPU accelerator specifically designed for dLLMs. It delivers a dLLM-oriented ISA and compiler, a hardware-optimized execution model for both the transformer inference and diffusion sampling used in dLLMs, a novel Block-Adaptive Online Smoothing (BAOS) for quantizing KV cache in dLLMs, and a complete RTL implementation synthesized in 7nm. Evaluation relies on a tri-path simulation framework comprising analytical, cycle-accurate, and accuracy simulators together with cross-validations against physical hardware.

What carries the argument

The dLLM-oriented ISA together with the Block-Adaptive Online Smoothing (BAOS) scheme that performs online per-block smoothing to handle step-wise KV cache distribution shifts during blocked diffusion inference.

If this is right

dLLM inference can now target a memory hierarchy and sampling units matched to bidirectional attention and iterative block refresh.
KV cache quantization becomes viable for dLLMs by tracking per-block activation shifts instead of assuming static distributions.
The open-sourced ISA, compiler, and simulation stack allow other groups to extend the accelerator for larger dLLM variants.
Non-GEMM top-k sampling stages receive dedicated hardware support rather than falling back to general-purpose cores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future NPU designs may incorporate configurable attention direction and refresh modes as first-class ISA features once dLLMs gain traction.
The BAOS technique could be tested on other iterative generative models that also exhibit block-local distribution drift.
If dLLMs scale to longer contexts, the blocked KV organization may reduce overall memory bandwidth compared with append-only AR caches.

Load-bearing premise

That diffusion language models will become common enough that their blocked KV refreshing and top-k sampling patterns cannot be supported efficiently by existing autoregressive NPUs without a custom ISA and quantization method.

What would settle it

Measure end-to-end inference latency and energy per token of a dLLM workload on the proposed 7 nm NPU versus the same workload running on a contemporary AR NPU; a large gap in favor of the new design would support the claim.

Figures

Figures reproduced from arXiv: 2601.20706 by Aaron Zhao, Binglei Lou, Can Xiao, Gregor MacDonald, Haoran Wu, Jianyi Cheng, Jiayi Nie, Kevin Lau, Rika Antonova, Robert Mullins, Xuan Guo, Yao Lai.

**Figure 2.** Figure 2: Diffusion Large Language Model. For each timestep, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Proposed NPU architecture for diffusion sampling. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Hardware–software co-design and verification work [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Latency and Memory utilization of diffusion sampling. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

read the original abstract

Diffusion-based LLMs (dLLMs) fundamentally depart from traditional autoregressive (AR) LLM inference: they leverage bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and a non-GEMM-centric sampling phase. These characteristics make current dLLMs incompatible with most existing NPUs, as their inference patterns, in particular the reduction-heavy, top-$k$-driven sampling stage, demand new ISA and memory hierarchy support beyond that of AR accelerators. In addition, the blocked diffusion KV cache breaks from the append-only paradigm assumed by AR NPUs, and conventional AR-derived KV quantization schemes were designed for static activation distributions and do not account for the step-wise distribution shifts introduced by iterative block-wise refinement in dLLMs. In this paper, we introduce the first NPU accelerator specifically designed for dLLMs. It delivers: a dLLM-oriented ISA and compiler; a hardware-optimized execution model for both the transformer inference and diffusion sampling used in dLLMs; a novel Block-Adaptive Online Smoothing (BAOS) for quantizing KV cache in dLLMs; and a complete RTL implementation synthesized in 7nm. To evaluate and validate our design, we introduce a tri-path simulation framework that comprises analytical, cycle-accurate, and accuracy simulators, together with cross-validations against physical hardware. The full NPU stack, including ISA, simulation tools, and quantization software, will be open-sourced upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a first hardware proposal for diffusion LLMs that flags real inference differences from AR models and ships a 7nm RTL plus tri-path simulator, but supplies no performance numbers or comparisons.

read the letter

The paper introduces the first NPU aimed at diffusion language models. It adds a dLLM-specific ISA and compiler, a hardware execution model that handles bidirectional attention plus block-wise KV refresh instead of append-only caches, and a new Block-Adaptive Online Smoothing scheme for quantizing the KV cache under step-wise distribution shifts. A full RTL design synthesized in 7nm and a tri-path simulator (analytical, cycle-accurate, accuracy) with hardware cross-validation are also included, along with a commitment to open-source the stack.

Referee Report

2 major / 0 minor

Summary. The manuscript presents the design of the first NPU accelerator tailored for diffusion-based language models (dLLMs). Key contributions include a dLLM-specific ISA and compiler, a hardware execution model handling bidirectional attention and diffusion sampling, the Block-Adaptive Online Smoothing (BAOS) method for dynamic KV cache quantization, and a full RTL design synthesized in 7nm technology. Evaluation relies on a tri-path simulator (analytical, cycle-accurate, accuracy) with hardware cross-validation, and the authors commit to open-sourcing the entire stack.

Significance. Should the quantitative results support the claims, this work would be significant in addressing the mismatch between dLLM inference patterns and existing AR-oriented NPUs. The open-sourcing of ISA, compiler, and tools is a notable strength that could facilitate further research in this emerging area.

major comments (2)

[Abstract] Abstract: The central claim of a 'complete RTL implementation synthesized in 7nm' is load-bearing for the practicality of the design, yet the abstract supplies no quantitative metrics on area, power, frequency, or speedup versus AR NPUs; without these, the assertion that existing accelerators are insufficient cannot be evaluated.
[Evaluation] Evaluation: The tri-path simulation framework is presented as the validation vehicle, but no details appear on cross-validation error bounds, specific benchmarks, or how accuracy simulator results map to the cycle-accurate model; this directly affects assessment of BAOS and the execution model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a 'complete RTL implementation synthesized in 7nm' is load-bearing for the practicality of the design, yet the abstract supplies no quantitative metrics on area, power, frequency, or speedup versus AR NPUs; without these, the assertion that existing accelerators are insufficient cannot be evaluated.

Authors: We agree that the abstract should include key quantitative metrics to support the central claims. The full manuscript reports 7nm synthesis results with specific figures for area, power, frequency, and speedups versus AR NPUs, but these were omitted from the abstract for brevity. We will revise the abstract to incorporate these metrics (e.g., achieved frequency, area, power, and comparative performance gains) so that the practicality assertion can be directly evaluated. revision: yes
Referee: [Evaluation] Evaluation: The tri-path simulation framework is presented as the validation vehicle, but no details appear on cross-validation error bounds, specific benchmarks, or how accuracy simulator results map to the cycle-accurate model; this directly affects assessment of BAOS and the execution model.

Authors: We acknowledge that additional transparency on the tri-path framework is warranted. The manuscript describes the analytical, cycle-accurate, and accuracy simulators along with hardware cross-validation, but we will expand the evaluation section to specify the exact benchmarks, report cross-validation error bounds, and detail the mapping methodology between the accuracy simulator outputs and the cycle-accurate model. This will enable better assessment of BAOS and the execution model. revision: yes

Circularity Check

0 steps flagged

No significant circularity in hardware design proposal

full rationale

The paper is a hardware architecture and systems proposal for dLLM inference, introducing an ISA, compiler, execution model, BAOS KV quantization, and 7nm RTL. No mathematical derivation chain, equations, or predictions are present that reduce to fitted parameters, self-definitions, or self-citation load-bearing steps. The design is presented as self-contained with explicit tri-path simulation cross-validated to physical hardware and plans for open-sourcing, satisfying the criteria for an independent contribution without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The design rests on the domain assumption that dLLMs exhibit fundamentally different inference patterns from AR LLMs that existing NPUs cannot efficiently support.

axioms (1)

domain assumption dLLMs leverage bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and a non-GEMM-centric sampling phase
Stated directly in the abstract as the reason current NPUs are incompatible.

invented entities (1)

Block-Adaptive Online Smoothing (BAOS) no independent evidence
purpose: Quantizing KV cache while accounting for step-wise distribution shifts in dLLMs
New quantization method introduced to handle iterative block-wise refinement not addressed by conventional AR-derived schemes.

pith-pipeline@v0.9.0 · 5593 in / 1518 out tokens · 42527 ms · 2026-05-16T09:57:08.953040+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 6 internal anchors

[1]

Discrete diffusion in large language and multimodal models: A survey,

R. Yu, Q. Li, and X. Wang, “Discrete diffusion in large language and multimodal models: A survey,”arXiv preprint arXiv:2506.13759, 2025

work page arXiv 2025
[2]

dinfer: An efficient inference framework for diffusion language models,

Y . Ma, L. Du, L. Wei, K. Chen, Q. Xu, K. Wang, G. Feng, G. Lu, L. Liu, X. Qiet al., “dinfer: An efficient inference framework for diffusion language models,”arXiv preprint arXiv:2510.08666, 2025

work page arXiv 2025
[3]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

C. Wu, H. Zhang, S. Xue, Z. Liu, S. Diao, L. Zhu, P. Luo, S. Han, and E. Xie, “Fast-dllm: Training-free acceleration of diffu- sion llm by enabling kv cache and parallel decoding,”arXiv preprint arXiv:2505.22618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Large Language Diffusion Models

S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.- R. Wen, and C. Li, “Large language diffusion models,”arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Dream 7B: Diffusion Large Language Models

J. Ye, Z. Xie, L. Zheng, J. Gao, Z. Wu, X. Jiang, Z. Li, and L. Kong, “Dream 7b: Diffusion large language models,”arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

H. Wu, C. Xiao, J. Nie, X. Guo, B. Lou, J. T. Wong, Z. Mo, C. Zhang, P. Forys, W. Luket al., “Combating the memory walls: Optimization pathways for long-context agentic llm inference,”arXiv preprint arXiv:2509.09505, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Tandem processor: Grappling with emerging operators in neural networks,

S. Ghodrati, S. Kinzer, H. Xu, R. Mahapatra, Y . Kim, B. H. Ahn, D. K. Wang, L. Karthikeyan, A. Yazdanbakhsh, J. Parket al., “Tandem processor: Grappling with emerging operators in neural networks,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1165–1182

work page 2024
[8]

Fast on-device llm inference with npus,

D. Xu, H. Zhang, L. Yang, R. Liu, G. Huang, M. Xu, and X. Liu, “Fast on-device llm inference with npus,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 445–462

work page 2025
[9]

Mandheling: Mixed-precision on-device dnn training with dsp offloading,

D. Xu, M. Xu, Q. Wang, S. Wang, Y . Ma, K. Huang, G. Huang, X. Jin, and X. Liu, “Mandheling: Mixed-precision on-device dnn training with dsp offloading,” inProceedings of the 28th Annual International Conference on Mobile Computing And Networking, 2022, pp. 214–227

work page 2022
[10]

Scaling llm test-time compute with mobile npu on smart- phones,

Z. Hao, J. Wei, T. Wang, M. Huang, H. Jiang, S. Jiang, T. Cao, and J. Ren, “Scaling llm test-time compute with mobile npu on smart- phones,”arXiv preprint arXiv:2509.23324, 2025

work page arXiv 2025
[11]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,”arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Ramulator 2.0: A modern, modular, and extensible dram simulator,

H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible dram simulator,”IEEE Computer Architecture Letters, vol. 23, no. 1, pp. 112– 116, 2023

work page 2023
[13]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

M. Arriola, A. Gokaslan, J. T. Chiu, Z. Yang, Z. Qi, J. Han, S. S. Sahoo, and V . Kuleshov, “Block diffusion: Interpolating between autoregressive and diffusion language models,”arXiv preprint arXiv:2503.09573, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Asap7: A 7-nm finfet predictive process design kit,

L. T. Clark, V . Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, and G. Yeric, “Asap7: A 7-nm finfet predictive process design kit,”Microelectronics Journal, vol. 53, pp. 105–115, 2016

work page 2016