ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Daliang Xu; Gang Huang; Mengwei Xu; Wangsong Yin; Xuanzhe Liu

arxiv: 2508.16703 · v4 · submitted 2025-08-22 · 💻 cs.PF · cs.AI· cs.LG

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Wangsong Yin , Daliang Xu , Mengwei Xu , Gang Huang , Xuanzhe Liu This is my paper

Pith reviewed 2026-05-18 21:59 UTC · model grok-4.3

classification 💻 cs.PF cs.AIcs.LG

keywords shadowAttnsparse attentionon-device LLMNPU inferencequantization sensitivitysystem-algorithm co-designattention fallback

0 comments

The pith

shadowAttn keeps attention on the NPU for on-device LLMs by using pilot compute to sparsely process only important tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that the attention operator in LLMs can remain on specialized NPU hardware instead of falling back to CPU or GPU due to quantization sensitivity. It does so through shadowAttn, a co-designed sparse attention module that estimates key tokens with a small NPU pilot computation and then computes attention only on those tokens. Additional techniques such as NPU compute graph bucketing, head-wise pipelining between NPU and CPU/GPU, and per-head sparsity ratios help maintain accuracy while improving efficiency. If this holds, on-device LLM inference becomes faster and simpler to schedule with far less dependence on general-purpose processors. A reader would care because current frameworks suffer degraded performance and added system complexity from the fallback, limiting practical privacy-preserving AI on phones and edge devices.

Core claim

ShadowAttn is a system-algorithm co-designed sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.

What carries the argument

shadowAttn, a sparse attention module that hides token-importance estimation overhead via NPU pilot compute and applies graph bucketing, head-wise pipelining, and per-head sparsity to limit CPU/GPU use.

If this is right

shadowAttn delivers the best performance with highly limited CPU/GPU resource.
It requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.
NPU compute graph bucketing and head-wise pipeline support both accuracy and efficiency.
Per-head fine-grained sparsity ratio allows tailored trade-offs across attention heads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pilot-compute approach for token selection could apply to other quantization-sensitive operators beyond attention.
Devices with very constrained CPU or GPU cores might now support larger models without major accuracy trade-offs.
System schedulers for on-device inference could become simpler by keeping nearly all work on the NPU.

Load-bearing premise

Sparsely calculating attention on only a tiny portion of tokens selected via NPU pilot compute preserves model accuracy without significant degradation.

What would settle it

A direct accuracy or perplexity comparison on a standard LLM benchmark showing substantial quality loss when replacing full attention with shadowAttn at the same sparsity level.

Figures

Figures reproduced from arXiv: 2508.16703 by Daliang Xu, Gang Huang, Mengwei Xu, Wangsong Yin, Xuanzhe Liu.

**Figure 1.** Figure 1: The workflow of static compute graph of mobile NPUs. The latency is acquired on QNN SDK [6] by a basic matrix multiplication operation. Dataset PhoneLM -0.5B PhoneLM -1.5B Qwen2 -0.5B Qwen2 -1.5B C/G N C/G N C/G N C/G N ArxivSum [18] 14.7 0.0 11.9 0.0 10.7 9.4 8.5 9.1 DroidCall [61] 27.5 20.5 20.5 19.0 34.5 27.5 48.0 22.5 Octopus [17] 64.6 24.1 79.2 24.7 60.6 34.8 61.2 34.2 [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 2.** Figure 2: The attention score skewness of LLMs. Profiled on 128 samples from WikiText-2. Q K Attention Scores Q K/V O Estimation Stage Attention Stage [0, 0] [1, 1] … [6, 1] [7, 3] Indices of top k values [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The workflow of sparse attention. 2.2 Sparse Attention The opportunity of minimizing the reliance on CPU/GPU is the highly sparse characteristic of attention operation. The attention can be highly sparse. We observe that only a small fraction of tokens in the attention mechanism are truly important. We evaluate 128 randomly sampled data points from the WikiText-2 corpus [35], analyzing two randomly selecte… view at source ↗

**Figure 6.** Figure 6: The importance is uneven across heads and layers.(a): Removing the heads in layer 1 of PhoneLM-0.5B; (b) removing the layers of PhoneLM-0.5B. The data is on 128 samples of WikiText-2. Loss values over 1e-3 are clamped to 1e-3. The y-axis of subfigure (b) is processed by log10. 3.2 Dynamic Sparse Attention of shadowAttn Head-specific sparse ratio. One of shadowAttn’s insight is that the sparse ratio of att… view at source ↗

**Figure 7.** Figure 7: The CDF of each head’s scale factors of Q/K. Model: Qwen2-0.5B; data: 128 samples from WikiText-2. The x axis is logged by 10. minutes for a mobile LLM on a cloud server with a single A100 GPU, being affordable for most developers. Running estimation on NPU. Another key insight of shadowAttn is that the estimation can be offloaded to lowprecision NPU. shadowAttn’s observation is that only determining the… view at source ↗

**Figure 9.** Figure 9: An illustration of NPU-CPU/GPU pipeline. execution obeys 𝜁 𝑖 𝑛𝑝𝑢 ← 𝜁 𝑖 𝑡𝑜𝑝𝑘 ; 𝜁 𝑖 𝑛𝑝𝑢, 𝜁 𝑖 𝑡𝑜𝑝𝑘 ← 𝜁 𝑖 𝑞𝑘𝑣, ∀𝑖 ∈ 𝑛, (4) where “←” means the dependency. The naive way is running each operation sequentially (Figure 9(1)). However, this ignores several key optimizations in this procedure. shadowAttn further introduces the following insights. Overlapping. Both 𝜁 𝑖 𝑛𝑝𝑢 and 𝜁 𝑖 𝑞𝑘𝑣 operations are computebound, … view at source ↗

**Figure 10.** Figure 10: With the same circumstance of highly limited CPU/GPU resources, shadowAttn can achieve much lower attention kernel latency compared to other baselines on MI14. 0.0 1.0 2.0 Inference Latency (Minute) 2.1 1.8 0.9 1.1 1.7 PhoneLM-0.5B (ArxivSum) 0.0 1.0 2.0 2.12.0 1.0 1.2 1.9 PhoneLM-0.5B (DroidCall) 0.0 0.2 0.5 0.8 1.0 0.9 0.8 0.2 0.4 0.6 PhoneLM-0.5B (Octopus) 0.0 1.0 2.0 3.0 4.0 3.4 2.6 1.41.2 1.9 PhoneLM… view at source ↗

**Figure 11.** Figure 11: With the same circumstance of highly limited CPU/GPU resources, shadowAttn can achieve much lower end-to-end average inference latency on datasets of real-world mobile tasks compared to other baselines on MI14. 0.0 0.2 0.4 0.6 Inference Latency (Minute) 0.5 0.6 0.2 PhoneLM-0.5B (ArxivSum) 0.0 0.5 1.0 1.5 1.2 1.3 0.7 PhoneLM-0.5B (DroidCall) 0.0 0.5 1.0 1.5 1.3 1.4 0.8 PhoneLM-0.5B (Octopus) 0.0 0.2 0.5 0.… view at source ↗

**Figure 12.** Figure 12: Compared to the native attention in SoTA NPU inference framework that shows high reliance on CPU/GPU, shadowAttn achieves on-par or lower latency with significantly fewer CPU/GPU resources. Device: MI14. Model Dataset C/GFull C/GSparse C/G-Block -Sparse NPUFull Ours PhoneLM -0.5B ArxivSum 14.7 14.9 10.0 0.0 15.2 DroidCall 27.5 24.0 25.5 20.5 25.5 Octopus 64.6 71.3 62.9 24.1 64.0 PhoneLM -1.5B ArxivSum … view at source ↗

**Figure 14.** Figure 14: Sensitivity analysis of scale factor buckets. Prime Core Middle Core Small Core 1 2 3 4 Inference Latency (Minute) Octopus C/GPU-Full Ours Prime Core Middle Core Small Core 2 4 6 8 10 ArxivSum C/GPU-Full Ours Prime Core Middle Core Small Core 2.5 5.0 7.5 10.0 DroidCall C/GPU-Full Ours [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 15.** Figure 15: Varying the available resource of CPU/GPU. 30 40 50 60 w/o C/GPU w/o head-ratio w/o buckets w/o pipeline Ours 34.8 52.8 60.1 61.2 61.2 Accuracy 0.0 0.2 0.4 0.6 0.60 0.31 0.31 0.31 0.20 Latency (Min.) [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗

**Figure 16.** Figure 16: Ablation study on Qwen2-0.5B, MI14, Octopus. 5.2 Sensitivity Analysis and Ablation Study Global sparsity ratio. We show the sensitivity of sparsity ratio in [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗

read the original abstract

On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents ShadowNPU, a system-algorithm co-design for NPU-centric on-device LLM inference. It introduces shadowAttn, a sparse attention module that hides token-importance estimation overhead inside an NPU-based pilot compute, then performs attention only on a tiny selected token subset. Additional techniques include NPU compute-graph bucketing, head-wise NPU-CPU/GPU pipelining, and per-head fine-grained sparsity ratios. The central claim is that shadowAttn achieves on-par accuracy with state-of-the-art frameworks while requiring substantially less CPU/GPU resource.

Significance. If the accuracy-preservation claim holds under realistic workloads, the work would be a practical advance for on-device LLMs: it directly attacks the quantization-induced fallback of attention to CPU/GPU, which currently degrades latency and complicates scheduling. The co-design emphasis and explicit handling of NPU graph constraints are strengths that could influence future hardware-software interfaces for quantized inference.

major comments (2)

[Abstract, §4] Abstract and §4 (evaluation): the central claim that shadowAttn 'delivers the best performance with highly limited CPU/GPU resource' and 'on-par performance of SoTA frameworks' is stated without any quantitative numbers, error bars, or ablation tables in the abstract and is only weakly supported in the evaluation description. Without measured accuracy deltas, latency breakdowns, or resource-usage figures, the load-bearing performance assertion cannot be assessed.
[§3.2] §3.2 (shadowAttn pilot): the assumption that NPU-pilot-selected sparse attention preserves accuracy is load-bearing yet unsupported by concrete evidence. The manuscript does not report the pilot's approximation error relative to full attention, nor does it show token-selection stability across context lengths or model scales; if the pilot misses high-attention tokens, the 'on-par' result collapses even if CPU/GPU usage drops.

minor comments (2)

[Figure 3] Figure 3 (pipeline diagram): the head-wise NPU-CPU/GPU pipeline is difficult to follow because the figure lacks explicit timing annotations or resource-occupancy bars; adding these would clarify the claimed overlap benefit.
[§3.3] §3.3: the per-head fine-grained sparsity ratio is introduced without a precise formula or pseudocode; a short algorithmic listing would remove ambiguity about how the ratio is computed and applied at runtime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the quantitative support for our claims and providing additional evidence for accuracy preservation. We address each major comment below and have made revisions to the manuscript to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (evaluation): the central claim that shadowAttn 'delivers the best performance with highly limited CPU/GPU resource' and 'on-par performance of SoTA frameworks' is stated without any quantitative numbers, error bars, or ablation tables in the abstract and is only weakly supported in the evaluation description. Without measured accuracy deltas, latency breakdowns, or resource-usage figures, the load-bearing performance assertion cannot be assessed.

Authors: We agree that the abstract would benefit from explicit quantitative results to make the performance claims more concrete and assessable. In the revised manuscript we have updated the abstract to include specific metrics such as CPU/GPU resource reduction percentages and accuracy deltas relative to SoTA frameworks. For §4, the original evaluation already contains direct comparisons, but we have expanded it with additional tables reporting accuracy deltas, per-component latency breakdowns, resource-usage figures, and error bars derived from repeated runs to more robustly substantiate the central claims. revision: yes
Referee: [§3.2] §3.2 (shadowAttn pilot): the assumption that NPU-pilot-selected sparse attention preserves accuracy is load-bearing yet unsupported by concrete evidence. The manuscript does not report the pilot's approximation error relative to full attention, nor does it show token-selection stability across context lengths or model scales; if the pilot misses high-attention tokens, the 'on-par' result collapses even if CPU/GPU usage drops.

Authors: We acknowledge the value of explicit quantification for the pilot's fidelity. In the revised §3.2 we have added a new analysis subsection that reports the pilot's approximation error (measured as the L1 difference in attention scores versus full attention) and includes experiments demonstrating token-selection stability across multiple context lengths and model scales. These additions show that the selected token subsets reliably capture high-attention tokens, thereby supporting the observed accuracy preservation while still reducing CPU/GPU fallback. revision: yes

Circularity Check

0 steps flagged

No significant circularity in engineering co-design

full rationale

The paper presents shadowAttn as a practical system-algorithm co-design for NPU-centric sparse attention, relying on techniques such as NPU pilot compute for token selection, graph bucketing, head-wise pipelining, and per-head sparsity ratios. These are described as engineering choices validated through implementation and benchmarking rather than derived quantities. No equations, predictions, or first-principles results reduce by construction to fitted inputs or self-referential definitions. Claims of on-par accuracy and reduced CPU/GPU usage are framed as empirical outcomes, not tautological re-statements of the design itself. Self-citations, if present, are not load-bearing for any central mathematical result. The derivation chain is self-contained as an applied systems contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical systems contribution with no mathematical derivations, free parameters, or new physical entities; it relies on standard assumptions about attention sparsity and NPU hardware capabilities.

pith-pipeline@v0.9.0 · 5724 in / 1196 out tokens · 28303 ms · 2026-05-18T21:59:51.790801+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

shadowAttn offloads the estimation to NPU... only a small portion of tokens are computed on CPU/GPU with high precision float operations... per-head fine-grained sparsity ratio
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NPU compute graph bucketing... scale factor buckets... step size 5e-1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
cs.LG 2026-04 unverdicted novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...
EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices
cs.OS 2026-04 unverdicted novelty 6.0

EdgeFlow reduces mobile LLM cold-start latency up to 4.07x versus llama.cpp, MNN, and llm.npu by NPU-aware adaptive quantization, SIMD-friendly packing, and synergistic granular CPU-NPU pipelining at comparable accuracy.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 2 Pith papers · 18 internal anchors

[1]

ARM NEON

2025. ARM NEON. https://www.arm.com/technologies/neon

work page 2025
[2]

Hexagon NPU SDK

2025. Hexagon NPU SDK. https://www.qualcomm.com/developer/ software/hexagon-npu-sdk

work page 2025
[3]

2025. LLVM. https://llvm.org/

work page 2025
[4]

Nvidia Jetson Orin

2025. Nvidia Jetson Orin. https://www.nvidia.com/en-us/autonomous- machines/embedded-systems/jetson-orin/

work page 2025
[5]

2025. Open CL. https://en.wikipedia.org/wiki/OpenCL

work page 2025
[6]

2025. QNN SDK. https://docs.qualcomm.com/bundle/publicresource/ topics/80-63442-50/introduction.html

work page 2025
[7]

Qualcomm Neural Processing Engine

2025. Qualcomm Neural Processing Engine. https://docs.qualcomm. com/bundle/publicresource/topics/80-70015-15BY/snpe.html

work page 2025
[8]

2025. rewind. https://www.rewind.ai/

work page 2025
[9]

Snapdragon 8 gen 3 mobile platform product brief

2025. Snapdragon 8 gen 3 mobile platform product brief. https://docs.qualcomm.com/bundle/publicresource/87-71408-1_ REV_C_Snapdragon_8_gen_3_Mobile_Platform_Product_Brief.pdf

work page 2025
[10]

TMS320F2812 platform product brief

2025. TMS320F2812 platform product brief. https://www.ti.com/ product/TMS320F2812

work page 2025
[11]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin and etc. Jyoti Aneja. 2024. Phi-3 Technical Re- port: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL] https://arxiv.org/abs/2404.14219

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Ozan Baris, Yizhuo Chen, Gaofeng Dong, Liying Han, Tomoyoshi Kimura, Pengrui Quan, Ruijie Wang, Tianchen Wang, Tarek Ab- delzaher, Mario Bergés, Paul Pu Liang, and Mani Srivastava. 2025. Foundation Models for CPS-IoT: Opportunities and Challenges. arXiv:2501.16368 [cs.LG] https://arxiv.org/abs/2501.16368

work page arXiv 2025
[14]

Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov

work page
[15]

Small Language Models are the Future of Agentic AI

Small Language Models are the Future of Agentic AI. arXiv:2506.02153 [cs.AI] https://arxiv.org/abs/2506.02153

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. 2025. A Survey on Mixture of Experts in Large Language Models. IEEE Transactions on Knowledge and Data Engineering (2025), 1–20. doi: 10.1109/tkde.2025.3554028

work page doi:10.1109/tkde.2025.3554028 2025
[17]

Le Chen, Dahu Feng, Erhu Feng, Rong Zhao, Yingrui Wang, Yu- bin Xia, Haibo Chen, and Pinjie Xu. 2025. HeteroLLM: Accelerat- ing Large Language Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators. arXiv:2501.14794 [cs.DC] https: //arxiv.org/abs/2501.14794

work page arXiv 2025
[18]

Wei Chen and Zhiyuan Li. 2024. Octopus v2: On-device language model for super agent. arXiv:2404.01744 [cs.CL] https://arxiv.org/abs/ 2404.01744

work page arXiv 2024
[19]

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. A Discourse- Aware Attention Model for Abstractive Summarization of Long Doc- uments. In Proceedings of the 2018 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 2...

work page doi:10.18653/v1/n18-2097 2018
[20]

Ziyan Fu, Ju Ren, Deyu Zhang, Yuezhi Zhou, and Yaoxue Zhang. 2022. Kalmia: A Heterogeneous QoS-aware Scheduling Framework for DNN Tasks on Edge Servers. In IEEE INFOCOM 2022 - IEEE Conference on Computer Communications. 780–789. doi: 10.1109/INFOCOM48880. 2022.9796661

work page doi:10.1109/infocom48880 2022
[21]

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2024. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. arXiv:2310.01801 [cs.CL] https://arxiv. org/abs/2310.01801

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

ggml. 2025. llama.cpp. https://github.com/ggml-org/llama.cpp

work page 2025
[23]

Joo Seong Jeong, Jingyu Lee, Donghyun Kim, Changmin Jeon, Changjin Jeong, Youngki Lee, and Byung-Gon Chun. 2022. Band: co- ordinated multi-DNN inference on heterogeneous mobile processors. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services (Portland, Oregon) (MobiSys ’22). Association for Computing Mach...

work page doi:10.1145/3498361.3538948 2022
[24]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xu- fang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Ac- celerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention. arXiv:2407.02490 [cs.CL] https://arxiv.org/abs/2407.02490

work page arXiv 2024
[26]

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. 2025. FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference. arXiv:2502.20766 [cs.LG] https://arxiv.org/ abs/2502.20766

work page arXiv 2025
[27]

Ko, Sangeun Oh, and Insik Shin

Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y. Ko, Sangeun Oh, and Insik Shin. 2024. Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation. arXiv:2312.03003 [cs.HC] https://arxiv.org/ abs/2312.03003

work page arXiv 2024
[28]

Liang Li, Xingke Yang, Wen Wu, Hao Wang, Tomoaki Ohtsuki, Xin Fu, Miao Pan, and Xuemin Shen. 2025. MobiLLM: Enabling LLM Fine-Tuning on the Mobile Device via Server Assisted Side Tuning. arXiv:2502.20421 [cs.LG] https://arxiv.org/abs/2502.20421

work page arXiv 2025
[29]

Xiang Li, Zhenyan Lu, Dongqi Cai, Xiao Ma, and Mengwei Xu

work page
[30]

In Proceedings of the Workshop on Edge and Mobile Foundation Models (Minato-ku, Tokyo, Japan) (EdgeFM ’24)

Large Language Models on Mobile Devices: Measurements, Analysis, and Insights. In Proceedings of the Workshop on Edge and Mobile Foundation Models (Minato-ku, Tokyo, Japan) (EdgeFM ’24). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3662006.3662059

work page doi:10.1145/3662006.3662059
[31]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978 [cs.CL] https: //arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Kaiwei Liu, Bufang Yang, Lilin Xu, Yunqi Guo, Guoliang Xing, Xian Shuai, Xiaozhe Ren, Xin Jiang, and Zhenyu Yan. 2025. TaskSense: A Translation-like Approach for Tasking Heterogeneous Sensor Systems with LLMs. Association for Computing Machinery, New York, NY, 12 Dynamic Sparse Attention on Mobile SoCs USA, 213–225. https://doi.org/10.1145/3715014.3722070

work page doi:10.1145/3715014.3722070 2025
[33]

Mukul Lokhande, Gopal Raut, and Santosh Kumar Vishvakarma. 2024. Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads. arXiv:2412.11702 [cs.AR] https://arxiv.org/abs/2412.11702

work page arXiv 2024
[34]

Mukul Lokhande and Santosh Kumar Vishvakarma. 2025. PO- LARON: Precision-aware On-device Learning and Adaptive Runtime- cONfigurable AI acceleration. arXiv:2506.08785 [cs.AR] https://arxiv. org/abs/2506.08785

work page arXiv 2025
[35]

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu. 2025. MoBA: Mixture of Block Attention for Long...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei

work page
[37]

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv:2402.17764 [cs.CL] https://arxiv.org/abs/2402.17764

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher

work page
[39]

Pointer Sentinel Mixture Models

Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv
[40]

AlShami, and Jugal Kalita

Melkamu Mersha, Khang Lam, Joseph Wood, Ali K. AlShami, and Jugal Kalita. 2024. Explainable artificial intelligence: A survey of needs, techniques, applications, and future direction. Neurocomputing 599 (Sept. 2024), 128111. doi: 10.1016/j.neucom.2024.128111

work page doi:10.1016/j.neucom.2024.128111 2024
[41]

2023-2025

MLC team. 2023-2025. MLC-LLM. https://github.com/mlc-ai/mlc-llm

work page 2023
[42]

Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, and Tinoosh Mohsenin. 2025. GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices. arXiv:2502.15816 [cs.DC] https://arxiv.org/abs/2502.15816

work page arXiv 2025
[43]

Xiaomin Ouyang, Xian Shuai, Yang Li, Li Pan, Xifan Zhang, Hem- ing Fu, Sitong Cheng, Xinyan Wang, Shihua Cao, Jiang Xin, Hazel Mok, Zhenyu Yan, Doris Sau Fung Yu, Timothy Kwok, and Guo- liang Xing. 2024. ADMarker: A Multi-Modal Federated Learning System for Monitoring Digital Biomarkers of Alzheimer’s Disease. arXiv:2310.15301 [cs.LG] https://arxiv.org/ab...

work page arXiv 2024
[44]

Jun-Seok Park, Changsoo Park, Suknam Kwon, Taeho Jeon, Yesung Kang, Heonsoo Lee, Dongwoo Lee, James Kim, Hyeong-Seok Kim, YoungJong Lee, Sangkyu Park, MinSeong Kim, SangHyuck Ha, Jihoon Bang, Jinpyo Park, SukHwan Lim, and Inyup Kang. 2023. A Multi- Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flag...

work page doi:10.1109/jssc 2023
[45]

Dan Peng, Zhihui Fu, and Jun Wang. 2024. PocketLLM: Enabling On- Device Fine-Tuning for Personalized LLMs. arXiv:2407.01031 [cs.LG] https://arxiv.org/abs/2407.01031

work page arXiv 2024
[46]

phonelm. 2025. PhoneLM-0.5B. https://huggingface.co/unsloth/ PhoneLM-0.5B

work page 2025
[47]

phonelm. 2025. PhoneLM-1.5B. https://huggingface.co/unsloth/ PhoneLM-1.5B

work page 2025
[48]

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

qwen. 2025. Qwen2-0.5B. https://huggingface.co/unsloth/Qwen2-0.5B

work page 2025
[50]

qwen. 2025. Qwen2-1.5B. https://huggingface.co/unsloth/Qwen2-1.5B

work page 2025
[51]

redmi. 2025. Redmi K60 Champion Edition Smartphone . https://www. gsmarena.com/xiaomi_redmi_k60_pro-12046.php

work page 2025
[52]

Tanmoy Sen, Haiying Shen, and Anand Padmanabha Iyer. 2025. Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heteroge- neous Accelerator Execution. In Proceedings of the Twentieth Euro- pean Conference on Computer Systems (Rotterdam, Netherlands) (Eu- roSys ’25). Association for Computing Machinery, New York, NY, USA, 507–523. doi: 10.1145/368903...

work page doi:10.1145/3689031.3696067 2025
[53]

Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, and Babak Eht- eshami Bejnordi. 2025. Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference. arXiv:2412.00099 [cs.LG] https: //arxiv.org/abs/2412.00099

work page arXiv 2025
[54]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864 [cs.CL] https://arxiv.org/abs/ 2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Shreyas Subramanian, Vikram Elango, and Mecit Gungor. 2025. Small Language Models (SLMs) Can Still Pack a Punch: A survey. arXiv:2501.05465 [cs.CL] https://arxiv.org/abs/2501.05465

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards End-to- End Optimization of LLM-based Applications with Ayo. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Rotterdam, Netherlands) (ASPLOS ’25). Association for Computing Machinery, New York, NY, USA, 13...

work page doi:10.1145/3676641.3716278 2025
[57]

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. Quest: Query-Aware Sparsity for Efficient Long- Context LLM Inference. arXiv:2406.10774 [cs.CL] https://arxiv.org/ abs/2406.10774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

TFLite team. 2025. mediapipe. https://ai.google.dev/edge/mediapipe/ solutions/guide

work page 2025
[59]

Mobillama: Towards accurate and lightweight fully transparent gpt

Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, and Fahad Shahbaz Khan. 2024. MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT. arXiv:2402.16840 [cs.CL] https: //arxiv.org/abs/2402.16840

work page arXiv 2024
[60]

Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. HPCA (2021)

work page 2021
[61]

Jianyu Wei, Ting Cao, Shijie Cao, Shiqi Jiang, Shaowei Fu, Mao Yang, Yanyong Zhang, and Yunxin Liu. 2023. NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi- Processors. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services (Helsinki, Finland) (Mo- biSys ’23). Asso...

work page doi:10.1145/3581791.3596870 2023
[62]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2024. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438 [cs.CL] https://arxiv.org/abs/2211.10438

work page arXiv 2024
[63]

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2024. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads. arXiv:2410.10819 [cs.CL] https://arxiv.org/abs/2410.10819

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

xiaomi. 2025. MI14 Smartphone. https://www.mi.com/global/product/ xiaomi-14/specs/

work page 2025
[65]

Weikai Xie, Li Zhang, Shihe Wang, Rongjie Yi, and Mengwei Xu. 2024. DroidCall: A Dataset for LLM-powered Android Intent Invocation. arXiv:2412.00402 [cs.AI] https://arxiv.org/abs/2412.00402

work page arXiv 2024
[66]

Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2025. EdgeLLM: Fast On-Device LLM Inference With Speculative Decoding. IEEE Transactions on Mobile Computing 24, 4 (2025), 3256–3273. doi: 10.1109/TMC.2024. 3513457 13 Wangsong Yin♦, Daliang Xu^, Mengwei Xu^, Gang Huang♦, Xuanzhe Liu♦

work page doi:10.1109/tmc.2024 2025
[67]

Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Meng- wei Xu, and Xuanzhe Liu. 2025. Fast On-device LLM Inference with NPUs. In Proceedings of the 30th ACM International Confer- ence on Architectural Support for Programming Languages and Op- erating Systems, Volume 1 (Rotterdam, Netherlands) (ASPLOS ’25). Association for Computing Machinery, Ne...

work page doi:10.1145/3669940.3707239 2025
[68]

Mengwei Xu, Dongqi Cai, Wangsong Yin, Shangguang Wang, Xin Jin, and Xuanzhe Liu. 2025. Resource-efficient Algorithms and Systems of Foundation Models: A Survey. ACM Comput. Surv. 57, 5, Article 110 (Jan. 2025), 39 pages. doi: 10.1145/3706418

work page doi:10.1145/3706418 2025
[69]

Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, and Xuanzhe Liu. 2024. A Sur- vey of Resource-efficient LLM and Multimodal Foundation Models. arXiv:2401.08092 [cs.LG] https://arxiv.org/abs/2401.08092

work page arXiv 2024
[70]

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. 2025. XAttention: Block Sparse Attention with Antidiagonal Scoring. arXiv:2503.16428 [cs.CL] https://arxiv.org/abs/2503.16428

work page arXiv 2025
[71]

Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, and Haibo Chen. 2024. PowerInfer-2: Fast Large Language Model Inference on a Smartphone. arXiv:2406.06282 [cs.LG] https://arxiv.org/abs/2406. 06282

work page arXiv 2024
[72]

Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang Xing, and Zhenyu Yan

work page
[73]

arXiv:2505.14668 [cs.AI] https: //arxiv.org/abs/2505.14668

ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions. arXiv:2505.14668 [cs.AI] https: //arxiv.org/abs/2505.14668

work page arXiv
[74]

Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han. 2025. LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention. arXiv:2502.14866 [cs.CL] https://arxiv.org/abs/2502.14866

work page arXiv 2025
[75]

Juheon Yi and Youngki Lee. 2020. Heimdall: mobile GPU coordination platform for augmented reality applications. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (London, United Kingdom) (MobiCom ’20). Association for Computing Machinery, New York, NY, USA, Article 35, 14 pages. doi: 10.1145/ 3372224.3419192

work page arXiv 2020
[76]

Wangsong Yin, Daliang Xu, Gang Huang, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2024. PieBridge: Fast and Parameter- Efficient On-Device Training via Proxy Networks. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (Hangzhou, China) (SenSys ’24). Association for Computing Machinery, New York, NY, USA, 126–140. doi:...

work page doi:10.1145/3666025.3699327 2024
[77]

Wangsong Yin, Mengwei Xu, Yuanchun Li, and Xuanzhe Liu. 2024. LLM as a System Service on Mobile Devices. arXiv:2403.11805 [cs.OS] https://arxiv.org/abs/2403.11805

work page arXiv 2024
[78]

Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, and Xuanzhe Liu. 2024. ELMS: Elasticized Large Language Models On Mobile Devices. arXiv:2409.09071 [cs.DC] https://arxiv.org/abs/2409. 09071

work page arXiv 2024
[79]

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. 2025. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. arXiv:2502.11089 [cs.CL] https://arxiv.org/abs/2502.11089

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. 2025. Spargeattn: Accurate sparse attention accelerating any model inference. InInternational Conference on Machine Learning (ICML)

work page 2025

Showing first 80 references.

[1] [1]

ARM NEON

2025. ARM NEON. https://www.arm.com/technologies/neon

work page 2025

[2] [2]

Hexagon NPU SDK

2025. Hexagon NPU SDK. https://www.qualcomm.com/developer/ software/hexagon-npu-sdk

work page 2025

[3] [3]

2025. LLVM. https://llvm.org/

work page 2025

[4] [4]

Nvidia Jetson Orin

2025. Nvidia Jetson Orin. https://www.nvidia.com/en-us/autonomous- machines/embedded-systems/jetson-orin/

work page 2025

[5] [5]

2025. Open CL. https://en.wikipedia.org/wiki/OpenCL

work page 2025

[6] [6]

2025. QNN SDK. https://docs.qualcomm.com/bundle/publicresource/ topics/80-63442-50/introduction.html

work page 2025

[7] [7]

Qualcomm Neural Processing Engine

2025. Qualcomm Neural Processing Engine. https://docs.qualcomm. com/bundle/publicresource/topics/80-70015-15BY/snpe.html

work page 2025

[8] [8]

2025. rewind. https://www.rewind.ai/

work page 2025

[9] [9]

Snapdragon 8 gen 3 mobile platform product brief

2025. Snapdragon 8 gen 3 mobile platform product brief. https://docs.qualcomm.com/bundle/publicresource/87-71408-1_ REV_C_Snapdragon_8_gen_3_Mobile_Platform_Product_Brief.pdf

work page 2025

[10] [10]

TMS320F2812 platform product brief

2025. TMS320F2812 platform product brief. https://www.ti.com/ product/TMS320F2812

work page 2025

[11] [11]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin and etc. Jyoti Aneja. 2024. Phi-3 Technical Re- port: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL] https://arxiv.org/abs/2404.14219

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Ozan Baris, Yizhuo Chen, Gaofeng Dong, Liying Han, Tomoyoshi Kimura, Pengrui Quan, Ruijie Wang, Tianchen Wang, Tarek Ab- delzaher, Mario Bergés, Paul Pu Liang, and Mani Srivastava. 2025. Foundation Models for CPS-IoT: Opportunities and Challenges. arXiv:2501.16368 [cs.LG] https://arxiv.org/abs/2501.16368

work page arXiv 2025

[14] [14]

Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov

work page

[15] [15]

Small Language Models are the Future of Agentic AI

Small Language Models are the Future of Agentic AI. arXiv:2506.02153 [cs.AI] https://arxiv.org/abs/2506.02153

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. 2025. A Survey on Mixture of Experts in Large Language Models. IEEE Transactions on Knowledge and Data Engineering (2025), 1–20. doi: 10.1109/tkde.2025.3554028

work page doi:10.1109/tkde.2025.3554028 2025

[17] [17]

Le Chen, Dahu Feng, Erhu Feng, Rong Zhao, Yingrui Wang, Yu- bin Xia, Haibo Chen, and Pinjie Xu. 2025. HeteroLLM: Accelerat- ing Large Language Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators. arXiv:2501.14794 [cs.DC] https: //arxiv.org/abs/2501.14794

work page arXiv 2025

[18] [18]

Wei Chen and Zhiyuan Li. 2024. Octopus v2: On-device language model for super agent. arXiv:2404.01744 [cs.CL] https://arxiv.org/abs/ 2404.01744

work page arXiv 2024

[19] [19]

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. A Discourse- Aware Attention Model for Abstractive Summarization of Long Doc- uments. In Proceedings of the 2018 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 2...

work page doi:10.18653/v1/n18-2097 2018

[20] [20]

Ziyan Fu, Ju Ren, Deyu Zhang, Yuezhi Zhou, and Yaoxue Zhang. 2022. Kalmia: A Heterogeneous QoS-aware Scheduling Framework for DNN Tasks on Edge Servers. In IEEE INFOCOM 2022 - IEEE Conference on Computer Communications. 780–789. doi: 10.1109/INFOCOM48880. 2022.9796661

work page doi:10.1109/infocom48880 2022

[21] [21]

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2024. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. arXiv:2310.01801 [cs.CL] https://arxiv. org/abs/2310.01801

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

ggml. 2025. llama.cpp. https://github.com/ggml-org/llama.cpp

work page 2025

[23] [23]

Joo Seong Jeong, Jingyu Lee, Donghyun Kim, Changmin Jeon, Changjin Jeong, Youngki Lee, and Byung-Gon Chun. 2022. Band: co- ordinated multi-DNN inference on heterogeneous mobile processors. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services (Portland, Oregon) (MobiSys ’22). Association for Computing Mach...

work page doi:10.1145/3498361.3538948 2022

[24] [24]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xu- fang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Ac- celerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention. arXiv:2407.02490 [cs.CL] https://arxiv.org/abs/2407.02490

work page arXiv 2024

[26] [26]

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. 2025. FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference. arXiv:2502.20766 [cs.LG] https://arxiv.org/ abs/2502.20766

work page arXiv 2025

[27] [27]

Ko, Sangeun Oh, and Insik Shin

Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y. Ko, Sangeun Oh, and Insik Shin. 2024. Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation. arXiv:2312.03003 [cs.HC] https://arxiv.org/ abs/2312.03003

work page arXiv 2024

[28] [28]

Liang Li, Xingke Yang, Wen Wu, Hao Wang, Tomoaki Ohtsuki, Xin Fu, Miao Pan, and Xuemin Shen. 2025. MobiLLM: Enabling LLM Fine-Tuning on the Mobile Device via Server Assisted Side Tuning. arXiv:2502.20421 [cs.LG] https://arxiv.org/abs/2502.20421

work page arXiv 2025

[29] [29]

Xiang Li, Zhenyan Lu, Dongqi Cai, Xiao Ma, and Mengwei Xu

work page

[30] [30]

In Proceedings of the Workshop on Edge and Mobile Foundation Models (Minato-ku, Tokyo, Japan) (EdgeFM ’24)

Large Language Models on Mobile Devices: Measurements, Analysis, and Insights. In Proceedings of the Workshop on Edge and Mobile Foundation Models (Minato-ku, Tokyo, Japan) (EdgeFM ’24). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3662006.3662059

work page doi:10.1145/3662006.3662059

[31] [31]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978 [cs.CL] https: //arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Kaiwei Liu, Bufang Yang, Lilin Xu, Yunqi Guo, Guoliang Xing, Xian Shuai, Xiaozhe Ren, Xin Jiang, and Zhenyu Yan. 2025. TaskSense: A Translation-like Approach for Tasking Heterogeneous Sensor Systems with LLMs. Association for Computing Machinery, New York, NY, 12 Dynamic Sparse Attention on Mobile SoCs USA, 213–225. https://doi.org/10.1145/3715014.3722070

work page doi:10.1145/3715014.3722070 2025

[33] [33]

Mukul Lokhande, Gopal Raut, and Santosh Kumar Vishvakarma. 2024. Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads. arXiv:2412.11702 [cs.AR] https://arxiv.org/abs/2412.11702

work page arXiv 2024

[34] [34]

Mukul Lokhande and Santosh Kumar Vishvakarma. 2025. PO- LARON: Precision-aware On-device Learning and Adaptive Runtime- cONfigurable AI acceleration. arXiv:2506.08785 [cs.AR] https://arxiv. org/abs/2506.08785

work page arXiv 2025

[35] [35]

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu. 2025. MoBA: Mixture of Block Attention for Long...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei

work page

[37] [37]

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv:2402.17764 [cs.CL] https://arxiv.org/abs/2402.17764

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher

work page

[39] [39]

Pointer Sentinel Mixture Models

Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

AlShami, and Jugal Kalita

Melkamu Mersha, Khang Lam, Joseph Wood, Ali K. AlShami, and Jugal Kalita. 2024. Explainable artificial intelligence: A survey of needs, techniques, applications, and future direction. Neurocomputing 599 (Sept. 2024), 128111. doi: 10.1016/j.neucom.2024.128111

work page doi:10.1016/j.neucom.2024.128111 2024

[41] [41]

2023-2025

MLC team. 2023-2025. MLC-LLM. https://github.com/mlc-ai/mlc-llm

work page 2023

[42] [42]

Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, and Tinoosh Mohsenin. 2025. GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices. arXiv:2502.15816 [cs.DC] https://arxiv.org/abs/2502.15816

work page arXiv 2025

[43] [43]

Xiaomin Ouyang, Xian Shuai, Yang Li, Li Pan, Xifan Zhang, Hem- ing Fu, Sitong Cheng, Xinyan Wang, Shihua Cao, Jiang Xin, Hazel Mok, Zhenyu Yan, Doris Sau Fung Yu, Timothy Kwok, and Guo- liang Xing. 2024. ADMarker: A Multi-Modal Federated Learning System for Monitoring Digital Biomarkers of Alzheimer’s Disease. arXiv:2310.15301 [cs.LG] https://arxiv.org/ab...

work page arXiv 2024

[44] [44]

Jun-Seok Park, Changsoo Park, Suknam Kwon, Taeho Jeon, Yesung Kang, Heonsoo Lee, Dongwoo Lee, James Kim, Hyeong-Seok Kim, YoungJong Lee, Sangkyu Park, MinSeong Kim, SangHyuck Ha, Jihoon Bang, Jinpyo Park, SukHwan Lim, and Inyup Kang. 2023. A Multi- Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flag...

work page doi:10.1109/jssc 2023

[45] [45]

Dan Peng, Zhihui Fu, and Jun Wang. 2024. PocketLLM: Enabling On- Device Fine-Tuning for Personalized LLMs. arXiv:2407.01031 [cs.LG] https://arxiv.org/abs/2407.01031

work page arXiv 2024

[46] [46]

phonelm. 2025. PhoneLM-0.5B. https://huggingface.co/unsloth/ PhoneLM-0.5B

work page 2025

[47] [47]

phonelm. 2025. PhoneLM-1.5B. https://huggingface.co/unsloth/ PhoneLM-1.5B

work page 2025

[48] [48]

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

qwen. 2025. Qwen2-0.5B. https://huggingface.co/unsloth/Qwen2-0.5B

work page 2025

[50] [50]

qwen. 2025. Qwen2-1.5B. https://huggingface.co/unsloth/Qwen2-1.5B

work page 2025

[51] [51]

redmi. 2025. Redmi K60 Champion Edition Smartphone . https://www. gsmarena.com/xiaomi_redmi_k60_pro-12046.php

work page 2025

[52] [52]

Tanmoy Sen, Haiying Shen, and Anand Padmanabha Iyer. 2025. Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heteroge- neous Accelerator Execution. In Proceedings of the Twentieth Euro- pean Conference on Computer Systems (Rotterdam, Netherlands) (Eu- roSys ’25). Association for Computing Machinery, New York, NY, USA, 507–523. doi: 10.1145/368903...

work page doi:10.1145/3689031.3696067 2025

[53] [53]

Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, and Babak Eht- eshami Bejnordi. 2025. Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference. arXiv:2412.00099 [cs.LG] https: //arxiv.org/abs/2412.00099

work page arXiv 2025

[54] [54]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864 [cs.CL] https://arxiv.org/abs/ 2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Shreyas Subramanian, Vikram Elango, and Mecit Gungor. 2025. Small Language Models (SLMs) Can Still Pack a Punch: A survey. arXiv:2501.05465 [cs.CL] https://arxiv.org/abs/2501.05465

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards End-to- End Optimization of LLM-based Applications with Ayo. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Rotterdam, Netherlands) (ASPLOS ’25). Association for Computing Machinery, New York, NY, USA, 13...

work page doi:10.1145/3676641.3716278 2025

[57] [57]

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. Quest: Query-Aware Sparsity for Efficient Long- Context LLM Inference. arXiv:2406.10774 [cs.CL] https://arxiv.org/ abs/2406.10774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

TFLite team. 2025. mediapipe. https://ai.google.dev/edge/mediapipe/ solutions/guide

work page 2025

[59] [59]

Mobillama: Towards accurate and lightweight fully transparent gpt

Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, and Fahad Shahbaz Khan. 2024. MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT. arXiv:2402.16840 [cs.CL] https: //arxiv.org/abs/2402.16840

work page arXiv 2024

[60] [60]

Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. HPCA (2021)

work page 2021

[61] [61]

Jianyu Wei, Ting Cao, Shijie Cao, Shiqi Jiang, Shaowei Fu, Mao Yang, Yanyong Zhang, and Yunxin Liu. 2023. NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi- Processors. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services (Helsinki, Finland) (Mo- biSys ’23). Asso...

work page doi:10.1145/3581791.3596870 2023

[62] [62]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2024. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438 [cs.CL] https://arxiv.org/abs/2211.10438

work page arXiv 2024

[63] [63]

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2024. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads. arXiv:2410.10819 [cs.CL] https://arxiv.org/abs/2410.10819

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

xiaomi. 2025. MI14 Smartphone. https://www.mi.com/global/product/ xiaomi-14/specs/

work page 2025

[65] [65]

Weikai Xie, Li Zhang, Shihe Wang, Rongjie Yi, and Mengwei Xu. 2024. DroidCall: A Dataset for LLM-powered Android Intent Invocation. arXiv:2412.00402 [cs.AI] https://arxiv.org/abs/2412.00402

work page arXiv 2024

[66] [66]

Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2025. EdgeLLM: Fast On-Device LLM Inference With Speculative Decoding. IEEE Transactions on Mobile Computing 24, 4 (2025), 3256–3273. doi: 10.1109/TMC.2024. 3513457 13 Wangsong Yin♦, Daliang Xu^, Mengwei Xu^, Gang Huang♦, Xuanzhe Liu♦

work page doi:10.1109/tmc.2024 2025

[67] [67]

Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Meng- wei Xu, and Xuanzhe Liu. 2025. Fast On-device LLM Inference with NPUs. In Proceedings of the 30th ACM International Confer- ence on Architectural Support for Programming Languages and Op- erating Systems, Volume 1 (Rotterdam, Netherlands) (ASPLOS ’25). Association for Computing Machinery, Ne...

work page doi:10.1145/3669940.3707239 2025

[68] [68]

Mengwei Xu, Dongqi Cai, Wangsong Yin, Shangguang Wang, Xin Jin, and Xuanzhe Liu. 2025. Resource-efficient Algorithms and Systems of Foundation Models: A Survey. ACM Comput. Surv. 57, 5, Article 110 (Jan. 2025), 39 pages. doi: 10.1145/3706418

work page doi:10.1145/3706418 2025

[69] [69]

Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, and Xuanzhe Liu. 2024. A Sur- vey of Resource-efficient LLM and Multimodal Foundation Models. arXiv:2401.08092 [cs.LG] https://arxiv.org/abs/2401.08092

work page arXiv 2024

[70] [70]

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. 2025. XAttention: Block Sparse Attention with Antidiagonal Scoring. arXiv:2503.16428 [cs.CL] https://arxiv.org/abs/2503.16428

work page arXiv 2025

[71] [71]

Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, and Haibo Chen. 2024. PowerInfer-2: Fast Large Language Model Inference on a Smartphone. arXiv:2406.06282 [cs.LG] https://arxiv.org/abs/2406. 06282

work page arXiv 2024

[72] [72]

Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang Xing, and Zhenyu Yan

work page

[73] [73]

arXiv:2505.14668 [cs.AI] https: //arxiv.org/abs/2505.14668

ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions. arXiv:2505.14668 [cs.AI] https: //arxiv.org/abs/2505.14668

work page arXiv

[74] [74]

Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han. 2025. LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention. arXiv:2502.14866 [cs.CL] https://arxiv.org/abs/2502.14866

work page arXiv 2025

[75] [75]

Juheon Yi and Youngki Lee. 2020. Heimdall: mobile GPU coordination platform for augmented reality applications. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (London, United Kingdom) (MobiCom ’20). Association for Computing Machinery, New York, NY, USA, Article 35, 14 pages. doi: 10.1145/ 3372224.3419192

work page arXiv 2020

[76] [76]

Wangsong Yin, Daliang Xu, Gang Huang, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2024. PieBridge: Fast and Parameter- Efficient On-Device Training via Proxy Networks. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (Hangzhou, China) (SenSys ’24). Association for Computing Machinery, New York, NY, USA, 126–140. doi:...

work page doi:10.1145/3666025.3699327 2024

[77] [77]

Wangsong Yin, Mengwei Xu, Yuanchun Li, and Xuanzhe Liu. 2024. LLM as a System Service on Mobile Devices. arXiv:2403.11805 [cs.OS] https://arxiv.org/abs/2403.11805

work page arXiv 2024

[78] [78]

Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, and Xuanzhe Liu. 2024. ELMS: Elasticized Large Language Models On Mobile Devices. arXiv:2409.09071 [cs.DC] https://arxiv.org/abs/2409. 09071

work page arXiv 2024

[79] [79]

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. 2025. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. arXiv:2502.11089 [cs.CL] https://arxiv.org/abs/2502.11089

work page internal anchor Pith review Pith/arXiv arXiv 2025

[80] [80]

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. 2025. Spargeattn: Accurate sparse attention accelerating any model inference. InInternational Conference on Machine Learning (ICML)

work page 2025