Introduces the first NPU accelerator for diffusion language models with dLLM-specific ISA, hardware execution model, BAOS KV quantization, and 7nm RTL synthesis.
Fast on-device llm inference with npus
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting for runtime conditions.
citing papers explorer
-
NPU Design for Diffusion Language Model Inference
Introduces the first NPU accelerator for diffusion language models with dLLM-specific ISA, hardware execution model, BAOS KV quantization, and 7nm RTL synthesis.
-
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting for runtime conditions.