ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

· 2025 · cs.PF · arXiv 2508.16703

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.

EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices

cs.OS · 2026-04-10 · unverdicted · novelty 6.0

EdgeFlow reduces mobile LLM cold-start latency up to 4.07x versus llama.cpp, MNN, and llm.npu by NPU-aware adaptive quantization, SIMD-friendly packing, and synergistic granular CPU-NPU pipelining at comparable accuracy.

citing papers explorer

Showing 2 of 2 citing papers.

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs cs.LG · 2026-04-20 · unverdicted · none · ref 54 · internal anchor
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices cs.OS · 2026-04-10 · unverdicted · none · ref 69 · internal anchor
EdgeFlow reduces mobile LLM cold-start latency up to 4.07x versus llama.cpp, MNN, and llm.npu by NPU-aware adaptive quantization, SIMD-friendly packing, and synergistic granular CPU-NPU pipelining at comparable accuracy.

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer