ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
Snapdragon 8 gen 3 mobile platform product brief
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.PF 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.