arxiv: 2604.18788 · v1 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

Afsara Benazir , Felix Xiaozhu Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:27 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertsllm inferenceapple siliconneural processing unitnpu offloadinginference optimizationlong-context workloadsenergy efficient inference

0 comments

The pith

NPUMoE lets Mixture-of-Experts LLMs offload most work to Apple NPUs using offline expert calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NPUMoE, a runtime engine that runs Mixture-of-Experts large language models on the Neural Processing Units inside Apple Silicon chips. MoE models activate only a few experts per token, which creates changing tensor shapes and irregular operations that NPUs cannot handle directly. NPUMoE solves this by running offline calibration on sample data to predict which experts will be used most often, then groups those experts into static tiers and keeps their computation graphs loaded on the NPU. Dynamic routing and irregular steps fall back to the CPU or GPU. A reader would care because this makes long-context inference noticeably faster and more energy-efficient on existing Apple hardware without redesigning the chips.

Core claim

NPUMoE is a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation to NPU, while preserving a CPU/GPU fallback path for dynamic operations. NPUMoE uses offline calibration to estimate expert capacity and popularity that drives three key techniques: (1) Static tiers for expert capacity to address dynamic expert routing (2) Grouped expert execution to mitigate NPU concurrency limits (3) Load-aware expert compute graph residency to reduce CPU-NPU synchronization overhead. Experiments on Apple M-series devices using three representative MoE LLMs and four long-context workloads show that NPUMoE consistently outperforms baselines, reducing

What carries the argument

Offline calibration that produces static expert capacity tiers and popularity estimates, which then drive grouped NPU execution and load-aware graph residency.

If this is right

Most of the expert matrix multiplications in MoE models can be moved to the NPU even though routing is dynamic.
Long-context prefills on Apple M-series devices become 1.32x to 5.55x faster.
Energy used per token drops by 1.81x to 7.37x when the NPU handles the grouped static work.
CPU cycles spent on inference fall by 1.78x to 5.54x, freeing the processor for other tasks.
The same calibration and grouping approach works across different MoE models and workloads tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same offline calibration pattern could be tried on other mobile NPUs if their shape and concurrency constraints are similar.
Device makers might add more flexible support for scatter/gather and top-k inside future NPUs once software shows the payoff.
On-device long-context applications such as document summarization could become practical on phones and laptops that already contain Apple Silicon.
If calibration data is refreshed periodically from recent user sessions, the method might stay effective without full retraining.

Load-bearing premise

That offline calibration on representative data can produce static expert capacity tiers and popularity estimates that remain accurate enough at runtime to avoid frequent fallback or performance collapse on unseen long-context workloads.

What would settle it

Running the system on a long-context workload whose expert routing statistics differ sharply from the calibration data and checking whether latency, energy, and CPU gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2604.18788 by Afsara Benazir, Felix Xiaozhu Lin.

**Figure 1.** Figure 1: Our System: NPUMoE relative to CPUs/GPUs. Prior work primarily focuses on efficient execution of dense transformer models on mobile GPUs [9] and NPUs [11, 12, 43]. In contrast, MoE LLMs such as PhiMoE [1], Qwen3 MoE [47] etc. exhibit dynamic, sparse execution patterns that are fundamentally misaligned with NPU execution pipelines. In addition to executing a dense attention block, in MoE architectures, a ro… view at source ↗

**Figure 2.** Figure 2: Prefill phase dominates end-to-end inference. overhead. NPUMoE therefore maintains a small resident working set of expert compute graphs on NPU, formed by hot expert groups whose launches can be amortized, while relegating cold experts to the CPU/GPU fallback path. Expert popularity (§4.3) determines the hot/cold state of each expert. Implementation and Evaluation (§5) We implement NPUMoE from scratch, usi… view at source ↗

**Figure 4.** Figure 4: Runtime operations supported by ANE and CPU. expert FFN compute, gathers expert output, and attention and each expert must handle variable input tokens. Apple Neural Engine. is Apple Silicon’s dedicated NPU for energy-efficient, high-throughput neural network inference. Its unified-memory architecture allows the CPU, GPU, and ANE to access the same memory pool, making the platform attractive for on-device … view at source ↗

**Figure 8.** Figure 8: CPU-NPU synchronization overhead. benefit appears only when enough work is amortized per launch and graph compute isn’t too small (M=64) or too large (M=32048 or vocab size). CPU-NPU coordination. pays the cost of high synchronization [14]. Unified memory helps communication, but it does not eliminate coordination cost; several recent systems show that synchronization and launch overheads can be comparab… view at source ↗

**Figure 6.** Figure 6: A compute graph of a single MoE router. Dense MatMul Shape=(M,N,K) CPU only (ms) CPU+NPU (ms) NPU speedup (64, 4096, 4096) 1.27 1.24 1.03x (256, 4096, 4096) 3.80 2.67 1.35x (512, 4096, 4096) 7.33 3.80 1.94x (1024, 4096, 4096) 15.20 7.22 2.12x (2048, 4096, 4096) 30.78 13.15 2.34x (32048, 4096, 14336) 3322.26 3201.00 1.04x [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Runtime of dense matmuls on M2 Ultra demonstrates NPU efficiency. Benefit is reduced for very small (M=64) or very large matrices (M=32048). higher than execution time, making runtime dynamic graph construction on NPU impractical. Each compute graph is unique, components with similar shape but different weights cannot trivially share a single graph [21]. Scenario. Two deployment scenarios exist: (1) Avail… view at source ↗

**Figure 9.** Figure 9: Our system NPUMoE - incorporates key design (1) Static tiers for expert capacity (§4.1) (2) Grouped Expert Execution (§4.2) and (3) Load aware compute graph residency (§4.3). GPU resources remain available for background tasks while CPU and NPU synchronize between runtime operations. and (2) optimize for resource efficiency. In particular, we seek to achieve target latency goals, improve energy efficiency… view at source ↗

**Figure 10.** Figure 10: Static tiers for expert capacity. (§4.1) 3.3 Offline Calibration We perform an offline calibration pass that measures perlayer expert routing statistics and derives an expert popularity ranking. NPUMoE then uses this ranking to configure two runtime decisions: capacity-tier assignment (§4.1) and expert compute graph residency (§4.3). For each layer, we record how often each expert is selected and rank e… view at source ↗

**Figure 11.** Figure 11: Grouped Expert Execution (§4.2) 4.2 Grouped Expert Execution Following §4.1, a naive, straightforward approach is to allocate a separate compute graph for each expert after estimating its capacity but this incurs non-trivial dispatch overhead. Issue: Fine-grained per-expert execution launches many small graphs at runtime. Since these launches typically share a single device queue, they are serviced sequ… view at source ↗

**Figure 12.** Figure 12: Expert placement based on expert popularity on different compute units (§4.3). E0 means expert 0 and so on. token position multiplied by their corresponding routing weights. This layout eliminates dynamic indexing inside the compiled graph: expert-to-token assignment is resolved before invocation, and the model only sees a dense, statically shaped tensor. For example, if four experts receive (64, 32, 32,… view at source ↗

**Figure 13.** Figure 13: Trade-off analysis of latency, energy consumption, and CPU cycle usage for PhiMoE. NPUMoE achieves the optimal pareto frontier compared to all baselines. FP16 before computation [37]. Unless otherwise noted, all comparisons use the same model checkpoint, tokenizer, FP16 precision, and real routing traces from workload. Offline Calibration We profile and evaluate on separate randomized subsets of our test… view at source ↗

**Figure 15.** Figure 15: Energy consumption breakdown per token across varying prefill chunk sizes (C) and prompt lengths (P) for PhiMoE on the M2 Ultra. NPUMoE consistently demonstrates the highest energy efficiency across all configurations. 0 50 100 150 Ours ANEMLL CoreML(naïve) CoreML(CPU) prefill chunk = 512, prompt len. = 1024 0 50 100 150 Ours ANEMLL CoreML(naïve) CoreML(CPU) prefill chunk = 256, prompt len. = 1024 0 150 … view at source ↗

**Figure 16.** Figure 16: Total CPU cycles consumption across prefill workloads for PhiMoE on M2 Ultra (Lower is better). Reduced usage indicate effective offloading of compute to NPU. Our gains persist across both short and long prompts and remain robust as chunk size increases. For short prompts (length=1024), the largest advantage appears at chunk size 512, where CoreML (naïve), CoreML (CPU), and ANEMLL consume 7.37x, 3.19x, a… view at source ↗

**Figure 17.** Figure 17: Impact of our key techniques. (3) Ours-TG: Ours-T with grouped expert execution; and (4) Ours-all: Ours-TG with load-aware expert compute residency. Our three techniques make a significant contribution to the overall improvement [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗

**Figure 19.** Figure 19: Runtime latency and energy efficiency of PhiMoetiny running on M2 MAX. A.2 Qwen3-MoE [PITH_FULL_IMAGE:figures/full_fig_p014_19.png] view at source ↗

read the original abstract

Apple Neural Engine (ANE) is a dedicated neural processing unit (NPU) present in every Apple Silicon chip. Mixture-of-Experts (MoE) LLMs improve inference efficiency via sparse activation but are challenging for NPUs in three ways: expert routing is unpredictable and introduces dynamic tensor shapes that conflict with the shape-specific constraints of NPUs; several irregular operators, e.g., top-k, scatter/gather, etc., are not NPU-friendly; and launching many small expert kernels incurs substantial dispatch and synchronization overhead. NPUs are designed to offload AI compute from CPU and GPU; our goal is to enable such offloading for MoE inference, particularly during prefill, where long-context workloads consume substantial system resources. This paper presents NPUMoE, a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation to NPU, while preserving a CPU/GPU fallback path for dynamic operations. NPUMoE uses offline calibration to estimate expert capacity and popularity that drives three key techniques: (1) Static tiers for expert capacity to address dynamic expert routing (2) Grouped expert execution to mitigate NPU concurrency limits (3) Load-aware expert compute graph residency to reduce CPU-NPU synchronization overhead. Experiments on Apple M-series devices using three representative MoE LLMs and four long-context workloads show that NPUMoE consistently outperforms baselines, reducing latency by 1.32x-5.55x, improving energy efficiency by 1.81x-7.37x, and reducing CPU-cycle usage by 1.78x-5.54x through effective NPU offloading.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NPUMoE shows a practical way to offload MoE inference to Apple NPUs with static tiers and grouped execution, but the gains rest on calibration that may not survive input shifts.

read the letter

NPUMoE shows a practical way to offload MoE inference to Apple NPUs with static tiers and grouped execution, but the gains rest on calibration that may not survive input shifts. The paper identifies three real barriers for NPUs: dynamic expert routing that creates variable tensor shapes, irregular operators like top-k that the hardware does not support well, and the cost of launching many small kernels. It addresses them by running offline calibration to estimate expert popularity and capacity, then applying static tiers, grouped expert execution, and load-aware compute-graph residency so more work stays on the NPU during long-context prefill while keeping a CPU or GPU fallback for the rest. On three MoE models and four workloads on M-series chips, the reported speedups reach 1.32x to 5.55x in latency, 1.81x to 7.37x in energy, and 1.78x to 5.54x in CPU cycles. That is concrete engineering progress for on-device sparse models. The calibration step is the clear weak point. MoE routing is input-dependent, and long-context attention can change which experts activate. If the offline data does not match the test distribution, the static tiers become wrong and fallback traffic rises, eroding the gains. The abstract gives no ablations on calibration-to-test mismatch or measured fallback rates, so the robustness claim stays untested in the summary. The work is aimed at engineers who need to run large sparse models on Apple silicon or similar NPUs. A practitioner would find the mapping of MoE constraints to NPU limits useful even if they adapt the details. It is worth sending for peer review because the problem is timely and the implementation is grounded, though referees should ask for the missing distribution-shift checks and full experimental protocol.

Referee Report

2 major / 1 minor

Summary. The paper introduces NPUMoE, a runtime inference engine for Mixture-of-Experts LLMs on Apple Silicon NPUs. It addresses NPU limitations with MoE (unpredictable routing, irregular operators, dispatch overhead) by using offline calibration to derive static expert capacity tiers and popularity estimates. These drive three techniques: static capacity tiers, grouped expert execution, and load-aware compute graph residency, enabling NPU offloading of dense static computation with CPU/GPU fallback for dynamic parts. Experiments on M-series devices with three MoE LLMs and four long-context workloads report latency reductions of 1.32x-5.55x, energy efficiency gains of 1.81x-7.37x, and CPU-cycle reductions of 1.78x-5.54x.

Significance. If the empirical claims hold under scrutiny, this work would be significant for practical deployment of sparse MoE models on consumer NPUs, particularly for long-context prefill workloads that stress system resources. It offers concrete engineering solutions to make NPU offloading viable for dynamic routing patterns, which could improve accessibility and efficiency of large LLMs on Apple hardware. The paper credits its contributions through specific quantitative gains on representative models and workloads, though the absence of detailed validation limits immediate impact assessment.

major comments (2)

[Offline calibration and techniques description] The offline calibration of static expert capacity tiers and popularity estimates (described as driving all three core techniques) is load-bearing for the claimed speedups, yet no ablation is provided on calibration-test distribution mismatch or measured fallback frequency to CPU/GPU paths when long-context inputs shift activation patterns from the calibration data. This directly affects whether the 1.32x-5.55x latency and 1.81x-7.37x energy gains generalize beyond the four evaluated workloads.
[Experiments] The Experiments section (and abstract) reports specific performance numbers (latency, energy, CPU cycles) across three MoE LLMs and four workloads but supplies no baseline descriptions, workload definitions, error bars, hardware configurations, or statistical details, preventing verification of the central empirical claim that NPUMoE consistently outperforms via effective NPU offloading.

minor comments (1)

[Introduction] The abstract and introduction could more clearly distinguish the NPU-specific constraints (shape-specific kernels, concurrency limits) from general MoE challenges to sharpen the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that additional analysis on calibration robustness and complete experimental details are needed to strengthen the paper. We will revise accordingly.

read point-by-point responses

Referee: [Offline calibration and techniques description] The offline calibration of static expert capacity tiers and popularity estimates (described as driving all three core techniques) is load-bearing for the claimed speedups, yet no ablation is provided on calibration-test distribution mismatch or measured fallback frequency to CPU/GPU paths when long-context inputs shift activation patterns from the calibration data. This directly affects whether the 1.32x-5.55x latency and 1.81x-7.37x energy gains generalize beyond the four evaluated workloads.

Authors: We agree that demonstrating robustness to distribution shift is important. In the revised manuscript we will add an ablation evaluating NPUMoE on additional long-context inputs drawn from distributions distinct from the calibration set (e.g., different prompt styles and context lengths). We will also report the observed fallback frequency to CPU/GPU paths on the four evaluated workloads, confirming that static tiers keep fallback low. revision: yes
Referee: [Experiments] The Experiments section (and abstract) reports specific performance numbers (latency, energy, CPU cycles) across three MoE LLMs and four workloads but supplies no baseline descriptions, workload definitions, error bars, hardware configurations, or statistical details, preventing verification of the central empirical claim that NPUMoE consistently outperforms via effective NPU offloading.

Authors: We acknowledge the omission of these details. The revised Experiments section will include: explicit descriptions of all baselines, precise definitions and input characteristics of the four workloads, hardware specifications for the M-series devices, error bars with standard deviations from repeated runs, and any relevant statistical information. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with measured results

full rationale

The manuscript describes an engineering runtime system (NPUMoE) that uses offline calibration to set static expert tiers and popularity estimates, then applies three techniques (static capacity tiers, grouped execution, load-aware graph residency) and reports measured speedups on four workloads against baselines. No equations, first-principles derivations, or fitted predictions are presented as outputs that reduce to the calibration inputs by construction. No self-citation chains or uniqueness theorems are invoked to justify core claims. The contribution is therefore self-contained empirical evaluation rather than a closed derivation loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. Offline calibration is implied to produce fitted popularity and capacity values, but their exact form and fitting procedure are unavailable.

pith-pipeline@v0.9.0 · 5603 in / 1248 out tokens · 46727 ms · 2026-05-10T05:27:57.006915+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 22 canonical work pages · 10 internal anchors

[1]

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. 2024. Phi-4 technical report. arXiv preprint arXiv:2412.08905 (2024)

work page internal anchor Pith review arXiv 2024
[2]

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Effi- cient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369 (2023)

work page arXiv 2023
[3]

Osayamen Jonathan Aimuyo, Byungsoo Oh, and Rachee Singh. 2025. FlashMoE: Fast Distributed MoE in a Single Kernel. arXiv preprint arXiv:2506.04667 (2025)

work page arXiv 2025
[4]

Yash Akhauri, Ahmed F AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M Rush, Safeen Huda, and Mohamed S Abdelfattah. 2024. Shadowllm: Predictor-based contextual sparsity for large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 19154–19167

2024
[5]

ANEMLL Contributors. 2026. ANEMLL: Artificial Neural Engine Machine Learning Library.https://github.com/Anemll/AnemllGitHub repository, version 0.3.5 beta, accessed 2026-04-07

2026
[6]

antmikinka. 2024. Optimization Guidelines for the Apple Neural Engine (ANE).https://gist.github.com/antmikinka/ 715499ae63630575065b22e5cb6ad8dd. GitHub Gist. Accessed: 2026-04-15

2024
[7]

Apple Inc. 2023. Apple Reports First Quarter Results. Corporate Report. Apple Inc.https://www.apple.com/newsroom/2023/02/apple- reports-first-quarter-results/

2023
[8]

Apple Machine Learning Research. 2024. Deploying Attention- Based Vision Transformers to Apple Neural Engine. Apple Machine Learning Research (5 Jan 2024).https://machinelearning.apple.com/ research/vision-transformers

2024
[9]

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Zaharia, and Ion Stoica. 2025. Moe- lightning: High-throughput moe inference on memory-constrained gpus. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 715–730

2025
[10]

Le Chen, Dahu Feng, Erhu Feng, Yingrui Wang, Rong Zhao, Yubin Xia, Pinjie Xu, and Haibo Chen. 2025. Characterizing mobile soc for accelerating heterogeneous llm inference. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 359–374

2025
[11]

Le Chen, Dahu Feng, Erhu Feng, Rong Zhao, Yingrui Wang, Yubin Xia, Haibo Chen, and Pinjie Xu. 2025. Heterollm: Accelerating large language model inference on mobile socs platform with heterogeneous ai accelerators. arXiv e-prints (2025), arXiv–2501

2025
[12]

Zhiyang Chen, Daliang Xu, Haiyang Shen, Chiheng Lou, Mengwei Xu, Shangguang Wang, Xin Jin, and Yun Ma. 2025. Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution. arXiv preprint arXiv:2510.15312 (2025)

work page arXiv 2025
[13]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short pap...

2019
[14]

Yuntao Dai, Jing Wu, Hang Gu, and Teng Wang. 2026. Accelerat- ing OpenPangu Inference on NPU via Speculative Decoding. arXiv preprint arXiv:2603.03383 (2026)

work page arXiv 2026
[15]

Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, et al
[16]

In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles

Prefillonly: An inference engine for prefill-only workloads in large language model applications. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 399–414
[17]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39

2022
[18]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The L...

work page doi:10.5281/zenodo.12608602 2024
[19]

Zixu Hao, Jianyu Wei, Tuowei Wang, Minxing Huang, Huiqiang Jiang, Shiqi Jiang, Ting Cao, and Ju Ren. 2025. Scaling llm test-time compute with mobile npu on smartphones. arXiv preprint arXiv:2509.23324 (2025)

work page arXiv 2025
[20]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654 (2024)

work page internal anchor Pith review arXiv 2024
[21]

Paul Hübner, Andong Hu, Ivy Peng, and Stefano Markidis. 2025. Ap- ple vs. oranges: Evaluating the apple silicon m-series socs for hpc performance and efficiency. In 2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 45– 54

2025
[22]

Apple Inc. 2026. Core ML.https://developer.apple.com/machine- learning/core-ml/Accessed: 2026-03-02

2026
[23]

Apple Inc. 2026. Core ML Models.https://developer.apple.com/ machine-learning/models/Accessed: 2026-03-02

2026
[24]

Apple Inc. 2026. Deploying Transformers on the Apple Neural En- gine.https://machinelearning.apple.com/research/neural-engine- transformersAccessed: 2026-03-02

2026
[25]

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xu- fang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. 2024. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. Advances in Neural Information Processing Systems 37 (2024), 52481–52515

2024
[26]

Jaehoon Jung, Jinpyo Kim, and Jaejin Lee. 2023. Deepum: Tensor migration and prefetching in unified memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume2. 207–221

2023
[27]

Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci
[28]

In The Thirteenth International Conference on Learning Representations (ICLR)

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture- of-Experts Models. In The Thirteenth International Conference on Learning Representations (ICLR). 12 Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
[29]

Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, and Yunxin Liu. 2024. SwapMoE: Serv- ing Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)

2024
[30]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668 (2020)

work page internal anchor Pith review arXiv 2020
[31]

Minchong Li, Feng Zhou, and Xiaohui Song. 2025. Bild: Bi-directional logits difference loss for large language model distillation. In Proceedings of the 31st International Conference on Computational Linguistics. 1168–1182

2025
[32]

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434 (2024)

work page internal anchor Pith review arXiv 2024
[33]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al . 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. 2023. Deja vu: Contextual sparsity for efficient llms at infer- ence time. In International Conference on Machine Learning. PMLR, 22137–22176

2023
[35]

Microsoft. 2024. Phi-3.5-MoE-instruct.https://huggingface.co/ microsoft/Phi-3.5-MoE-instructHugging Face model card. Accessed: 2026-04-09

2024
[36]

Microsoft. 2025. Phi-tiny-MoE-instruct.https://huggingface.co/ microsoft/Phi-tiny-MoE-instructHugging Face model card. Accessed: 2026-04-09

2025
[37]

Seungjae Moon, Junseo Cha, Hyunjun Park, and Joo-Young Kim
[38]

In Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA)

Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window. In Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA). 808–820
[39]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Manjeet Singh. 2026. Inside the M4 Apple Neural Engine.https: //maderix.substack.com/p/inside-the-m4-apple-neural-engine-615. maderix’s Substack

2026
[41]

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al
[42]

Kimi-VL Technical Report

Kimi-vl technical report. arXiv preprint arXiv:2504.07491 (2025)

work page internal anchor Pith review arXiv 2025
[43]

The ML-Energy Initiative. 2025. zeus-apple-silicon: Zeus device support for Apple Silicon.https://github.com/ml-energy/zeus-apple- siliconGitHub repository

2025
[44]

Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Haoning Guan, Rui Qu, Maoliang Li, Xiang Chen, and Guojie Luo. 2025. Agent. xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc. arXiv preprint arXiv:2506.24045 (2025)

work page arXiv 2025
[45]

Wikipedia contributors. 2026. Apple Neural Engine.https://en. wikipedia.org/wiki/Neural_EngineAccessed: 2026-03-02

2026
[46]

Ao Xiao, Bangzheng He, Baoquan Zhang, Baoxing Huai, Bingji Wang, Bo Wang, Bo Xu, Boyi Hou, Chan Yang, Changhong Liu, et al. 2025. xdeepserve: Model-as-a-service on huawei cloudmatrix384. arXiv preprint arXiv:2508.02520 (2025)

work page arXiv 2025
[47]

Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Meng- wei Xu, and Xuanzhe Liu. 2025. Fast on-device llm inference with npus. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 445–462

2025
[48]

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. 2024. Moe- infinity: Efficient moe inference on personal machines with sparsity- aware expert cache. arXiv preprint arXiv:2401.14361 (2024)

work page arXiv 2024
[49]

Yuqi Xue, Yiqi Liu, Lifeng Nai, and Jian Huang. 2023. V10: Hardware- assisted npu multi-tenancy for improved resource utilization and fairness. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1–15

2023
[50]

Yuqi Xue, Yiqi Liu, Lifeng Nai, and Jian Huang. 2024. Hardware- assisted virtualization of neural processing units for cloud plat- forms. In 2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–16

2024
[51]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

C. Yang, Y. Sui, J. Xiao, et al. 2025. TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multi- modal Vision Language Model. arXiv:2503.18278 [cs.CV]

work page arXiv 2025
[53]

Juheon Yi and Youngki Lee. 2020. Heimdall: mobile GPU coordina- tion platform for augmented reality applications. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking. 1–14

2020
[54]

Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, and Xuanzhe Liu. 2025. Dynamic sparse attention on mobile socs. arXiv preprint arXiv:2508.16703 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Enda Yu, Zhaoning Zhang, Dezun Dong, Yongwei Wu, and Xi- angke Liao. 2025. PreScope: Unleashing the Power of Prefetch- ing for Resource-Constrained MoE Inference. arXiv preprint arXiv:2509.23638 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence?. In Proceedings of the 57th annual meeting of the association for computational linguistics. 4791–4800

2019
[57]

Zhang, M

H. Zhang, M. Lyu, C. He, Y. Ao, and Y. Lin. 2025. TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models. arXiv:2509.00320 [cs.CV]

work page arXiv 2025
[58]

Yuxin Zhou, Zheng Li, Jun Zhang, Jue Wang, Yiping Wang, Zhongle Xie, Ke Chen, and Lidan Shou. 2025. FloE: On-the-Fly MoE Inference on Memory-constrained GPU. In Proceedings of the International Conference on Machine Learning (ICML)

2025
[59]

Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, et al . 2025. Serving large language models on huawei cloudmatrix384. arXiv preprint arXiv:2506.12708 (2025). 13 Afsara Benazir and Felix Xiaozhu Lin A Appendix A.1 Phimoe-tiny 11.776 18.528 11.328 6.304 0 10 20 Prefill 5.6064 6.464 5.824 ...

work page arXiv 2025