pith. sign in

arxiv: 2508.16703 · v4 · submitted 2025-08-22 · 💻 cs.PF · cs.AI· cs.LG

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Pith reviewed 2026-05-18 21:59 UTC · model grok-4.3

classification 💻 cs.PF cs.AIcs.LG
keywords shadowAttnsparse attentionon-device LLMNPU inferencequantization sensitivitysystem-algorithm co-designattention fallback
0
0 comments X

The pith

shadowAttn keeps attention on the NPU for on-device LLMs by using pilot compute to sparsely process only important tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that the attention operator in LLMs can remain on specialized NPU hardware instead of falling back to CPU or GPU due to quantization sensitivity. It does so through shadowAttn, a co-designed sparse attention module that estimates key tokens with a small NPU pilot computation and then computes attention only on those tokens. Additional techniques such as NPU compute graph bucketing, head-wise pipelining between NPU and CPU/GPU, and per-head sparsity ratios help maintain accuracy while improving efficiency. If this holds, on-device LLM inference becomes faster and simpler to schedule with far less dependence on general-purpose processors. A reader would care because current frameworks suffer degraded performance and added system complexity from the fallback, limiting practical privacy-preserving AI on phones and edge devices.

Core claim

ShadowAttn is a system-algorithm co-designed sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.

What carries the argument

shadowAttn, a sparse attention module that hides token-importance estimation overhead via NPU pilot compute and applies graph bucketing, head-wise pipelining, and per-head sparsity to limit CPU/GPU use.

If this is right

  • shadowAttn delivers the best performance with highly limited CPU/GPU resource.
  • It requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.
  • NPU compute graph bucketing and head-wise pipeline support both accuracy and efficiency.
  • Per-head fine-grained sparsity ratio allows tailored trade-offs across attention heads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pilot-compute approach for token selection could apply to other quantization-sensitive operators beyond attention.
  • Devices with very constrained CPU or GPU cores might now support larger models without major accuracy trade-offs.
  • System schedulers for on-device inference could become simpler by keeping nearly all work on the NPU.

Load-bearing premise

Sparsely calculating attention on only a tiny portion of tokens selected via NPU pilot compute preserves model accuracy without significant degradation.

What would settle it

A direct accuracy or perplexity comparison on a standard LLM benchmark showing substantial quality loss when replacing full attention with shadowAttn at the same sparsity level.

Figures

Figures reproduced from arXiv: 2508.16703 by Daliang Xu, Gang Huang, Mengwei Xu, Wangsong Yin, Xuanzhe Liu.

Figure 1
Figure 1. Figure 1: The workflow of static compute graph of mobile NPUs. The latency is acquired on QNN SDK [6] by a basic matrix multiplication operation. Dataset PhoneLM -0.5B PhoneLM -1.5B Qwen2 -0.5B Qwen2 -1.5B C/G N C/G N C/G N C/G N ArxivSum [18] 14.7 0.0 11.9 0.0 10.7 9.4 8.5 9.1 DroidCall [61] 27.5 20.5 20.5 19.0 34.5 27.5 48.0 22.5 Octopus [17] 64.6 24.1 79.2 24.7 60.6 34.8 61.2 34.2 [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 2
Figure 2. Figure 2: The attention score skewness of LLMs. Profiled on 128 samples from WikiText-2. Q K Attention Scores Q K/V O Estimation Stage Attention Stage [0, 0] [1, 1] … [6, 1] [7, 3] Indices of top k values [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The workflow of sparse attention. 2.2 Sparse Attention The opportunity of minimizing the reliance on CPU/GPU is the highly sparse characteristic of attention operation. The attention can be highly sparse. We observe that only a small fraction of tokens in the attention mechanism are truly important. We evaluate 128 randomly sampled data points from the WikiText-2 corpus [35], analyzing two randomly selecte… view at source ↗
Figure 6
Figure 6. Figure 6: The importance is uneven across heads and lay￾ers.(a): Removing the heads in layer 1 of PhoneLM-0.5B; (b) removing the layers of PhoneLM-0.5B. The data is on 128 samples of WikiText-2. Loss values over 1e-3 are clamped to 1e-3. The y-axis of subfigure (b) is processed by log10. 3.2 Dynamic Sparse Attention of shadowAttn Head-specific sparse ratio. One of shadowAttn’s insight is that the sparse ratio of att… view at source ↗
Figure 7
Figure 7. Figure 7: The CDF of each head’s scale factors of Q/K. Model: Qwen2-0.5B; data: 128 samples from WikiText-2. The x axis is logged by 10. minutes for a mobile LLM on a cloud server with a single A100 GPU, being affordable for most developers. Running estimation on NPU. Another key insight of shadowAttn is that the estimation can be offloaded to low￾precision NPU. shadowAttn’s observation is that only de￾termining the… view at source ↗
Figure 9
Figure 9. Figure 9: An illustration of NPU-CPU/GPU pipeline. execution obeys 𝜁 𝑖 𝑛𝑝𝑢 ← 𝜁 𝑖 𝑡𝑜𝑝𝑘 ; 𝜁 𝑖 𝑛𝑝𝑢, 𝜁 𝑖 𝑡𝑜𝑝𝑘 ← 𝜁 𝑖 𝑞𝑘𝑣, ∀𝑖 ∈ 𝑛, (4) where “←” means the dependency. The naive way is running each operation sequentially (Fig￾ure 9(1)). However, this ignores several key optimizations in this procedure. shadowAttn further introduces the following insights. Overlapping. Both 𝜁 𝑖 𝑛𝑝𝑢 and 𝜁 𝑖 𝑞𝑘𝑣 operations are compute￾bound, … view at source ↗
Figure 10
Figure 10. Figure 10: With the same circumstance of highly limited CPU/GPU resources, shadowAttn can achieve much lower attention kernel latency compared to other baselines on MI14. 0.0 1.0 2.0 Inference Latency (Minute) 2.1 1.8 0.9 1.1 1.7 PhoneLM-0.5B (ArxivSum) 0.0 1.0 2.0 2.12.0 1.0 1.2 1.9 PhoneLM-0.5B (DroidCall) 0.0 0.2 0.5 0.8 1.0 0.9 0.8 0.2 0.4 0.6 PhoneLM-0.5B (Octopus) 0.0 1.0 2.0 3.0 4.0 3.4 2.6 1.41.2 1.9 PhoneLM… view at source ↗
Figure 11
Figure 11. Figure 11: With the same circumstance of highly limited CPU/GPU resources, shadowAttn can achieve much lower end-to-end average inference latency on datasets of real-world mobile tasks compared to other baselines on MI14. 0.0 0.2 0.4 0.6 Inference Latency (Minute) 0.5 0.6 0.2 PhoneLM-0.5B (ArxivSum) 0.0 0.5 1.0 1.5 1.2 1.3 0.7 PhoneLM-0.5B (DroidCall) 0.0 0.5 1.0 1.5 1.3 1.4 0.8 PhoneLM-0.5B (Octopus) 0.0 0.2 0.5 0.… view at source ↗
Figure 12
Figure 12. Figure 12: Compared to the native attention in SoTA NPU inference framework that shows high reliance on CPU/GPU, shadowAttn achieves on-par or lower latency with significantly fewer CPU/GPU resources. Device: MI14. Model Dataset C/G￾Full C/G￾Sparse C/G-Block -Sparse NPU￾Full Ours PhoneLM -0.5B ArxivSum 14.7 14.9 10.0 0.0 15.2 DroidCall 27.5 24.0 25.5 20.5 25.5 Octopus 64.6 71.3 62.9 24.1 64.0 PhoneLM -1.5B ArxivSum … view at source ↗
Figure 14
Figure 14. Figure 14: Sensitivity analysis of scale factor buckets. Prime Core Middle Core Small Core 1 2 3 4 Inference Latency (Minute) Octopus C/GPU-Full Ours Prime Core Middle Core Small Core 2 4 6 8 10 ArxivSum C/GPU-Full Ours Prime Core Middle Core Small Core 2.5 5.0 7.5 10.0 DroidCall C/GPU-Full Ours [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Varying the available resource of CPU/GPU. 30 40 50 60 w/o C/GPU w/o head-ratio w/o buckets w/o pipeline Ours 34.8 52.8 60.1 61.2 61.2 Accuracy 0.0 0.2 0.4 0.6 0.60 0.31 0.31 0.31 0.20 Latency (Min.) [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Ablation study on Qwen2-0.5B, MI14, Octopus. 5.2 Sensitivity Analysis and Ablation Study Global sparsity ratio. We show the sensitivity of sparsity ratio in [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗
read the original abstract

On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents ShadowNPU, a system-algorithm co-design for NPU-centric on-device LLM inference. It introduces shadowAttn, a sparse attention module that hides token-importance estimation overhead inside an NPU-based pilot compute, then performs attention only on a tiny selected token subset. Additional techniques include NPU compute-graph bucketing, head-wise NPU-CPU/GPU pipelining, and per-head fine-grained sparsity ratios. The central claim is that shadowAttn achieves on-par accuracy with state-of-the-art frameworks while requiring substantially less CPU/GPU resource.

Significance. If the accuracy-preservation claim holds under realistic workloads, the work would be a practical advance for on-device LLMs: it directly attacks the quantization-induced fallback of attention to CPU/GPU, which currently degrades latency and complicates scheduling. The co-design emphasis and explicit handling of NPU graph constraints are strengths that could influence future hardware-software interfaces for quantized inference.

major comments (2)
  1. [Abstract, §4] Abstract and §4 (evaluation): the central claim that shadowAttn 'delivers the best performance with highly limited CPU/GPU resource' and 'on-par performance of SoTA frameworks' is stated without any quantitative numbers, error bars, or ablation tables in the abstract and is only weakly supported in the evaluation description. Without measured accuracy deltas, latency breakdowns, or resource-usage figures, the load-bearing performance assertion cannot be assessed.
  2. [§3.2] §3.2 (shadowAttn pilot): the assumption that NPU-pilot-selected sparse attention preserves accuracy is load-bearing yet unsupported by concrete evidence. The manuscript does not report the pilot's approximation error relative to full attention, nor does it show token-selection stability across context lengths or model scales; if the pilot misses high-attention tokens, the 'on-par' result collapses even if CPU/GPU usage drops.
minor comments (2)
  1. [Figure 3] Figure 3 (pipeline diagram): the head-wise NPU-CPU/GPU pipeline is difficult to follow because the figure lacks explicit timing annotations or resource-occupancy bars; adding these would clarify the claimed overlap benefit.
  2. [§3.3] §3.3: the per-head fine-grained sparsity ratio is introduced without a precise formula or pseudocode; a short algorithmic listing would remove ambiguity about how the ratio is computed and applied at runtime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the quantitative support for our claims and providing additional evidence for accuracy preservation. We address each major comment below and have made revisions to the manuscript to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (evaluation): the central claim that shadowAttn 'delivers the best performance with highly limited CPU/GPU resource' and 'on-par performance of SoTA frameworks' is stated without any quantitative numbers, error bars, or ablation tables in the abstract and is only weakly supported in the evaluation description. Without measured accuracy deltas, latency breakdowns, or resource-usage figures, the load-bearing performance assertion cannot be assessed.

    Authors: We agree that the abstract would benefit from explicit quantitative results to make the performance claims more concrete and assessable. In the revised manuscript we have updated the abstract to include specific metrics such as CPU/GPU resource reduction percentages and accuracy deltas relative to SoTA frameworks. For §4, the original evaluation already contains direct comparisons, but we have expanded it with additional tables reporting accuracy deltas, per-component latency breakdowns, resource-usage figures, and error bars derived from repeated runs to more robustly substantiate the central claims. revision: yes

  2. Referee: [§3.2] §3.2 (shadowAttn pilot): the assumption that NPU-pilot-selected sparse attention preserves accuracy is load-bearing yet unsupported by concrete evidence. The manuscript does not report the pilot's approximation error relative to full attention, nor does it show token-selection stability across context lengths or model scales; if the pilot misses high-attention tokens, the 'on-par' result collapses even if CPU/GPU usage drops.

    Authors: We acknowledge the value of explicit quantification for the pilot's fidelity. In the revised §3.2 we have added a new analysis subsection that reports the pilot's approximation error (measured as the L1 difference in attention scores versus full attention) and includes experiments demonstrating token-selection stability across multiple context lengths and model scales. These additions show that the selected token subsets reliably capture high-attention tokens, thereby supporting the observed accuracy preservation while still reducing CPU/GPU fallback. revision: yes

Circularity Check

0 steps flagged

No significant circularity in engineering co-design

full rationale

The paper presents shadowAttn as a practical system-algorithm co-design for NPU-centric sparse attention, relying on techniques such as NPU pilot compute for token selection, graph bucketing, head-wise pipelining, and per-head sparsity ratios. These are described as engineering choices validated through implementation and benchmarking rather than derived quantities. No equations, predictions, or first-principles results reduce by construction to fitted inputs or self-referential definitions. Claims of on-par accuracy and reduced CPU/GPU usage are framed as empirical outcomes, not tautological re-statements of the design itself. Self-citations, if present, are not load-bearing for any central mathematical result. The derivation chain is self-contained as an applied systems contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical systems contribution with no mathematical derivations, free parameters, or new physical entities; it relies on standard assumptions about attention sparsity and NPU hardware capabilities.

pith-pipeline@v0.9.0 · 5724 in / 1196 out tokens · 28303 ms · 2026-05-18T21:59:51.790801+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

    cs.LG 2026-04 unverdicted novelty 6.0

    NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...

  2. EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices

    cs.OS 2026-04 unverdicted novelty 6.0

    EdgeFlow reduces mobile LLM cold-start latency up to 4.07x versus llama.cpp, MNN, and llm.npu by NPU-aware adaptive quantization, SIMD-friendly packing, and synergistic granular CPU-NPU pipelining at comparable accuracy.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 2 Pith papers · 18 internal anchors

  1. [1]

    ARM NEON

    2025. ARM NEON. https://www.arm.com/technologies/neon

  2. [2]

    Hexagon NPU SDK

    2025. Hexagon NPU SDK. https://www.qualcomm.com/developer/ software/hexagon-npu-sdk

  3. [3]

    2025. LLVM. https://llvm.org/

  4. [4]

    Nvidia Jetson Orin

    2025. Nvidia Jetson Orin. https://www.nvidia.com/en-us/autonomous- machines/embedded-systems/jetson-orin/

  5. [5]

    2025. Open CL. https://en.wikipedia.org/wiki/OpenCL

  6. [6]

    2025. QNN SDK. https://docs.qualcomm.com/bundle/publicresource/ topics/80-63442-50/introduction.html

  7. [7]

    Qualcomm Neural Processing Engine

    2025. Qualcomm Neural Processing Engine. https://docs.qualcomm. com/bundle/publicresource/topics/80-70015-15BY/snpe.html

  8. [8]

    2025. rewind. https://www.rewind.ai/

  9. [9]

    Snapdragon 8 gen 3 mobile platform product brief

    2025. Snapdragon 8 gen 3 mobile platform product brief. https://docs.qualcomm.com/bundle/publicresource/87-71408-1_ REV_C_Snapdragon_8_gen_3_Mobile_Platform_Product_Brief.pdf

  10. [10]

    TMS320F2812 platform product brief

    2025. TMS320F2812 platform product brief. https://www.ti.com/ product/TMS320F2812

  11. [11]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin and etc. Jyoti Aneja. 2024. Phi-3 Technical Re- port: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL] https://arxiv.org/abs/2404.14219

  12. [12]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  13. [13]

    Ozan Baris, Yizhuo Chen, Gaofeng Dong, Liying Han, Tomoyoshi Kimura, Pengrui Quan, Ruijie Wang, Tianchen Wang, Tarek Ab- delzaher, Mario Bergés, Paul Pu Liang, and Mani Srivastava. 2025. Foundation Models for CPS-IoT: Opportunities and Challenges. arXiv:2501.16368 [cs.LG] https://arxiv.org/abs/2501.16368

  14. [14]

    Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov

  15. [15]

    Small Language Models are the Future of Agentic AI

    Small Language Models are the Future of Agentic AI. arXiv:2506.02153 [cs.AI] https://arxiv.org/abs/2506.02153

  16. [16]

    Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. 2025. A Survey on Mixture of Experts in Large Language Models. IEEE Transactions on Knowledge and Data Engineering (2025), 1–20. doi: 10.1109/tkde.2025.3554028

  17. [17]

    Le Chen, Dahu Feng, Erhu Feng, Rong Zhao, Yingrui Wang, Yu- bin Xia, Haibo Chen, and Pinjie Xu. 2025. HeteroLLM: Accelerat- ing Large Language Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators. arXiv:2501.14794 [cs.DC] https: //arxiv.org/abs/2501.14794

  18. [18]

    Wei Chen and Zhiyuan Li. 2024. Octopus v2: On-device language model for super agent. arXiv:2404.01744 [cs.CL] https://arxiv.org/abs/ 2404.01744

  19. [19]

    Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. A Discourse- Aware Attention Model for Abstractive Summarization of Long Doc- uments. In Proceedings of the 2018 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 2...

  20. [20]

    Ziyan Fu, Ju Ren, Deyu Zhang, Yuezhi Zhou, and Yaoxue Zhang. 2022. Kalmia: A Heterogeneous QoS-aware Scheduling Framework for DNN Tasks on Edge Servers. In IEEE INFOCOM 2022 - IEEE Conference on Computer Communications. 780–789. doi: 10.1109/INFOCOM48880. 2022.9796661

  21. [21]

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2024. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. arXiv:2310.01801 [cs.CL] https://arxiv. org/abs/2310.01801

  22. [22]

    ggml. 2025. llama.cpp. https://github.com/ggml-org/llama.cpp

  23. [23]

    Joo Seong Jeong, Jingyu Lee, Donghyun Kim, Changmin Jeon, Changjin Jeong, Youngki Lee, and Byung-Gon Chun. 2022. Band: co- ordinated multi-DNN inference on heterogeneous mobile processors. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services (Portland, Oregon) (MobiSys ’22). Association for Computing Mach...

  24. [24]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

  25. [25]

    Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xu- fang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Ac- celerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention. arXiv:2407.02490 [cs.CL] https://arxiv.org/abs/2407.02490

  26. [26]

    Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. 2025. FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference. arXiv:2502.20766 [cs.LG] https://arxiv.org/ abs/2502.20766

  27. [27]

    Ko, Sangeun Oh, and Insik Shin

    Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y. Ko, Sangeun Oh, and Insik Shin. 2024. Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation. arXiv:2312.03003 [cs.HC] https://arxiv.org/ abs/2312.03003

  28. [28]

    Liang Li, Xingke Yang, Wen Wu, Hao Wang, Tomoaki Ohtsuki, Xin Fu, Miao Pan, and Xuemin Shen. 2025. MobiLLM: Enabling LLM Fine-Tuning on the Mobile Device via Server Assisted Side Tuning. arXiv:2502.20421 [cs.LG] https://arxiv.org/abs/2502.20421

  29. [29]

    Xiang Li, Zhenyan Lu, Dongqi Cai, Xiao Ma, and Mengwei Xu

  30. [30]

    In Proceedings of the Workshop on Edge and Mobile Foundation Models (Minato-ku, Tokyo, Japan) (EdgeFM ’24)

    Large Language Models on Mobile Devices: Measurements, Analysis, and Insights. In Proceedings of the Workshop on Edge and Mobile Foundation Models (Minato-ku, Tokyo, Japan) (EdgeFM ’24). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3662006.3662059

  31. [31]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978 [cs.CL] https: //arxiv.org/abs/2306.00978

  32. [32]

    Kaiwei Liu, Bufang Yang, Lilin Xu, Yunqi Guo, Guoliang Xing, Xian Shuai, Xiaozhe Ren, Xin Jiang, and Zhenyu Yan. 2025. TaskSense: A Translation-like Approach for Tasking Heterogeneous Sensor Systems with LLMs. Association for Computing Machinery, New York, NY, 12 Dynamic Sparse Attention on Mobile SoCs USA, 213–225. https://doi.org/10.1145/3715014.3722070

  33. [33]

    Mukul Lokhande, Gopal Raut, and Santosh Kumar Vishvakarma. 2024. Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads. arXiv:2412.11702 [cs.AR] https://arxiv.org/abs/2412.11702

  34. [34]

    Mukul Lokhande and Santosh Kumar Vishvakarma. 2025. PO- LARON: Precision-aware On-device Learning and Adaptive Runtime- cONfigurable AI acceleration. arXiv:2506.08785 [cs.AR] https://arxiv. org/abs/2506.08785

  35. [35]

    MoBA: Mixture of Block Attention for Long-Context LLMs

    Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu. 2025. MoBA: Mixture of Block Attention for Long...

  36. [36]

    Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei

  37. [37]

    The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv:2402.17764 [cs.CL] https://arxiv.org/abs/2402.17764

  38. [38]

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher

  39. [39]

    Pointer Sentinel Mixture Models

    Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]

  40. [40]

    AlShami, and Jugal Kalita

    Melkamu Mersha, Khang Lam, Joseph Wood, Ali K. AlShami, and Jugal Kalita. 2024. Explainable artificial intelligence: A survey of needs, techniques, applications, and future direction. Neurocomputing 599 (Sept. 2024), 128111. doi: 10.1016/j.neucom.2024.128111

  41. [41]

    2023-2025

    MLC team. 2023-2025. MLC-LLM. https://github.com/mlc-ai/mlc-llm

  42. [42]

    Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, and Tinoosh Mohsenin. 2025. GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices. arXiv:2502.15816 [cs.DC] https://arxiv.org/abs/2502.15816

  43. [43]

    Xiaomin Ouyang, Xian Shuai, Yang Li, Li Pan, Xifan Zhang, Hem- ing Fu, Sitong Cheng, Xinyan Wang, Shihua Cao, Jiang Xin, Hazel Mok, Zhenyu Yan, Doris Sau Fung Yu, Timothy Kwok, and Guo- liang Xing. 2024. ADMarker: A Multi-Modal Federated Learning System for Monitoring Digital Biomarkers of Alzheimer’s Disease. arXiv:2310.15301 [cs.LG] https://arxiv.org/ab...

  44. [44]

    Jun-Seok Park, Changsoo Park, Suknam Kwon, Taeho Jeon, Yesung Kang, Heonsoo Lee, Dongwoo Lee, James Kim, Hyeong-Seok Kim, YoungJong Lee, Sangkyu Park, MinSeong Kim, SangHyuck Ha, Jihoon Bang, Jinpyo Park, SukHwan Lim, and Inyup Kang. 2023. A Multi- Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flag...

  45. [45]

    Dan Peng, Zhihui Fu, and Jun Wang. 2024. PocketLLM: Enabling On- Device Fine-Tuning for Personalized LLMs. arXiv:2407.01031 [cs.LG] https://arxiv.org/abs/2407.01031

  46. [46]

    phonelm. 2025. PhoneLM-0.5B. https://huggingface.co/unsloth/ PhoneLM-0.5B

  47. [47]

    phonelm. 2025. PhoneLM-1.5B. https://huggingface.co/unsloth/ PhoneLM-1.5B

  48. [48]

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  49. [49]

    qwen. 2025. Qwen2-0.5B. https://huggingface.co/unsloth/Qwen2-0.5B

  50. [50]

    qwen. 2025. Qwen2-1.5B. https://huggingface.co/unsloth/Qwen2-1.5B

  51. [51]

    redmi. 2025. Redmi K60 Champion Edition Smartphone . https://www. gsmarena.com/xiaomi_redmi_k60_pro-12046.php

  52. [52]

    Tanmoy Sen, Haiying Shen, and Anand Padmanabha Iyer. 2025. Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heteroge- neous Accelerator Execution. In Proceedings of the Twentieth Euro- pean Conference on Computer Systems (Rotterdam, Netherlands) (Eu- roSys ’25). Association for Computing Machinery, New York, NY, USA, 507–523. doi: 10.1145/368903...

  53. [53]

    Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, and Babak Eht- eshami Bejnordi. 2025. Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference. arXiv:2412.00099 [cs.LG] https: //arxiv.org/abs/2412.00099

  54. [54]

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864 [cs.CL] https://arxiv.org/abs/ 2104.09864

  55. [55]

    Shreyas Subramanian, Vikram Elango, and Mecit Gungor. 2025. Small Language Models (SLMs) Can Still Pack a Punch: A survey. arXiv:2501.05465 [cs.CL] https://arxiv.org/abs/2501.05465

  56. [56]

    Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards End-to- End Optimization of LLM-based Applications with Ayo. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Rotterdam, Netherlands) (ASPLOS ’25). Association for Computing Machinery, New York, NY, USA, 13...

  57. [57]

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. Quest: Query-Aware Sparsity for Efficient Long- Context LLM Inference. arXiv:2406.10774 [cs.CL] https://arxiv.org/ abs/2406.10774

  58. [58]

    TFLite team. 2025. mediapipe. https://ai.google.dev/edge/mediapipe/ solutions/guide

  59. [59]

    Mobillama: Towards accurate and lightweight fully transparent gpt

    Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, and Fahad Shahbaz Khan. 2024. MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT. arXiv:2402.16840 [cs.CL] https: //arxiv.org/abs/2402.16840

  60. [60]

    Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. HPCA (2021)

  61. [61]

    Jianyu Wei, Ting Cao, Shijie Cao, Shiqi Jiang, Shaowei Fu, Mao Yang, Yanyong Zhang, and Yunxin Liu. 2023. NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi- Processors. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services (Helsinki, Finland) (Mo- biSys ’23). Asso...

  62. [62]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2024. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438 [cs.CL] https://arxiv.org/abs/2211.10438

  63. [63]

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2024. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads. arXiv:2410.10819 [cs.CL] https://arxiv.org/abs/2410.10819

  64. [64]

    xiaomi. 2025. MI14 Smartphone. https://www.mi.com/global/product/ xiaomi-14/specs/

  65. [65]

    Weikai Xie, Li Zhang, Shihe Wang, Rongjie Yi, and Mengwei Xu. 2024. DroidCall: A Dataset for LLM-powered Android Intent Invocation. arXiv:2412.00402 [cs.AI] https://arxiv.org/abs/2412.00402

  66. [66]

    Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2025. EdgeLLM: Fast On-Device LLM Inference With Speculative Decoding. IEEE Transactions on Mobile Computing 24, 4 (2025), 3256–3273. doi: 10.1109/TMC.2024. 3513457 13 Wangsong Yin♦, Daliang Xu^, Mengwei Xu^, Gang Huang♦, Xuanzhe Liu♦

  67. [67]

    Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Meng- wei Xu, and Xuanzhe Liu. 2025. Fast On-device LLM Inference with NPUs. In Proceedings of the 30th ACM International Confer- ence on Architectural Support for Programming Languages and Op- erating Systems, Volume 1 (Rotterdam, Netherlands) (ASPLOS ’25). Association for Computing Machinery, Ne...

  68. [68]

    Mengwei Xu, Dongqi Cai, Wangsong Yin, Shangguang Wang, Xin Jin, and Xuanzhe Liu. 2025. Resource-efficient Algorithms and Systems of Foundation Models: A Survey. ACM Comput. Surv. 57, 5, Article 110 (Jan. 2025), 39 pages. doi: 10.1145/3706418

  69. [69]

    Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, and Xuanzhe Liu. 2024. A Sur- vey of Resource-efficient LLM and Multimodal Foundation Models. arXiv:2401.08092 [cs.LG] https://arxiv.org/abs/2401.08092

  70. [70]

    Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. 2025. XAttention: Block Sparse Attention with Antidiagonal Scoring. arXiv:2503.16428 [cs.CL] https://arxiv.org/abs/2503.16428

  71. [71]

    Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, and Haibo Chen. 2024. PowerInfer-2: Fast Large Language Model Inference on a Smartphone. arXiv:2406.06282 [cs.LG] https://arxiv.org/abs/2406. 06282

  72. [72]

    Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang Xing, and Zhenyu Yan

  73. [73]

    arXiv:2505.14668 [cs.AI] https: //arxiv.org/abs/2505.14668

    ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions. arXiv:2505.14668 [cs.AI] https: //arxiv.org/abs/2505.14668

  74. [74]

    Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han. 2025. LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention. arXiv:2502.14866 [cs.CL] https://arxiv.org/abs/2502.14866

  75. [75]

    Juheon Yi and Youngki Lee. 2020. Heimdall: mobile GPU coordination platform for augmented reality applications. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (London, United Kingdom) (MobiCom ’20). Association for Computing Machinery, New York, NY, USA, Article 35, 14 pages. doi: 10.1145/ 3372224.3419192

  76. [76]

    Wangsong Yin, Daliang Xu, Gang Huang, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2024. PieBridge: Fast and Parameter- Efficient On-Device Training via Proxy Networks. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (Hangzhou, China) (SenSys ’24). Association for Computing Machinery, New York, NY, USA, 126–140. doi:...

  77. [77]

    Wangsong Yin, Mengwei Xu, Yuanchun Li, and Xuanzhe Liu. 2024. LLM as a System Service on Mobile Devices. arXiv:2403.11805 [cs.OS] https://arxiv.org/abs/2403.11805

  78. [78]

    Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, and Xuanzhe Liu. 2024. ELMS: Elasticized Large Language Models On Mobile Devices. arXiv:2409.09071 [cs.DC] https://arxiv.org/abs/2409. 09071

  79. [79]

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. 2025. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. arXiv:2502.11089 [cs.CL] https://arxiv.org/abs/2502.11089

  80. [80]

    Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. 2025. Spargeattn: Accurate sparse attention accelerating any model inference. InInternational Conference on Machine Learning (ICML)

Showing first 80 references.