pith. machine review for the scientific record. sign in

arxiv: 2604.18788 · v1 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:27 UTC · model grok-4.3

classification 💻 cs.LG
keywords mixture of expertsllm inferenceapple siliconneural processing unitnpu offloadinginference optimizationlong-context workloadsenergy efficient inference
0
0 comments X

The pith

NPUMoE lets Mixture-of-Experts LLMs offload most work to Apple NPUs using offline expert calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NPUMoE, a runtime engine that runs Mixture-of-Experts large language models on the Neural Processing Units inside Apple Silicon chips. MoE models activate only a few experts per token, which creates changing tensor shapes and irregular operations that NPUs cannot handle directly. NPUMoE solves this by running offline calibration on sample data to predict which experts will be used most often, then groups those experts into static tiers and keeps their computation graphs loaded on the NPU. Dynamic routing and irregular steps fall back to the CPU or GPU. A reader would care because this makes long-context inference noticeably faster and more energy-efficient on existing Apple hardware without redesigning the chips.

Core claim

NPUMoE is a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation to NPU, while preserving a CPU/GPU fallback path for dynamic operations. NPUMoE uses offline calibration to estimate expert capacity and popularity that drives three key techniques: (1) Static tiers for expert capacity to address dynamic expert routing (2) Grouped expert execution to mitigate NPU concurrency limits (3) Load-aware expert compute graph residency to reduce CPU-NPU synchronization overhead. Experiments on Apple M-series devices using three representative MoE LLMs and four long-context workloads show that NPUMoE consistently outperforms baselines, reducing

What carries the argument

Offline calibration that produces static expert capacity tiers and popularity estimates, which then drive grouped NPU execution and load-aware graph residency.

If this is right

  • Most of the expert matrix multiplications in MoE models can be moved to the NPU even though routing is dynamic.
  • Long-context prefills on Apple M-series devices become 1.32x to 5.55x faster.
  • Energy used per token drops by 1.81x to 7.37x when the NPU handles the grouped static work.
  • CPU cycles spent on inference fall by 1.78x to 5.54x, freeing the processor for other tasks.
  • The same calibration and grouping approach works across different MoE models and workloads tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same offline calibration pattern could be tried on other mobile NPUs if their shape and concurrency constraints are similar.
  • Device makers might add more flexible support for scatter/gather and top-k inside future NPUs once software shows the payoff.
  • On-device long-context applications such as document summarization could become practical on phones and laptops that already contain Apple Silicon.
  • If calibration data is refreshed periodically from recent user sessions, the method might stay effective without full retraining.

Load-bearing premise

That offline calibration on representative data can produce static expert capacity tiers and popularity estimates that remain accurate enough at runtime to avoid frequent fallback or performance collapse on unseen long-context workloads.

What would settle it

Running the system on a long-context workload whose expert routing statistics differ sharply from the calibration data and checking whether latency, energy, and CPU gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2604.18788 by Afsara Benazir, Felix Xiaozhu Lin.

Figure 1
Figure 1. Figure 1: Our System: NPUMoE relative to CPUs/GPUs. Prior work primarily focuses on efficient execution of dense transformer models on mobile GPUs [9] and NPUs [11, 12, 43]. In contrast, MoE LLMs such as PhiMoE [1], Qwen3 MoE [47] etc. exhibit dynamic, sparse execution patterns that are fundamentally misaligned with NPU execution pipelines. In addition to executing a dense attention block, in MoE architectures, a ro… view at source ↗
Figure 2
Figure 2. Figure 2: Prefill phase dominates end-to-end inference. overhead. NPUMoE therefore maintains a small resident working set of expert compute graphs on NPU, formed by hot expert groups whose launches can be amortized, while relegating cold experts to the CPU/GPU fallback path. Expert popularity (§4.3) determines the hot/cold state of each expert. Implementation and Evaluation (§5) We implement NPUMoE from scratch, usi… view at source ↗
Figure 4
Figure 4. Figure 4: Runtime operations supported by ANE and CPU. expert FFN compute, gathers expert output, and attention and each expert must handle variable input tokens. Apple Neural Engine. is Apple Silicon’s dedicated NPU for energy-efficient, high-throughput neural network inference. Its unified-memory architecture allows the CPU, GPU, and ANE to access the same memory pool, making the platform attractive for on-device … view at source ↗
Figure 8
Figure 8. Figure 8: CPU-NPU synchronization overhead. benefit appears only when enough work is amortized per launch and graph compute isn’t too small (M=64) or too large (M=32048 or vocab size). CPU-NPU coordination. pays the cost of high synchro￾nization [14]. Unified memory helps communication, but it does not eliminate coordination cost; several recent systems show that synchronization and launch overheads can be com￾parab… view at source ↗
Figure 6
Figure 6. Figure 6: A compute graph of a single MoE router. Dense MatMul Shape=(M,N,K) CPU only (ms) CPU+NPU (ms) NPU speedup (64, 4096, 4096) 1.27 1.24 1.03x (256, 4096, 4096) 3.80 2.67 1.35x (512, 4096, 4096) 7.33 3.80 1.94x (1024, 4096, 4096) 15.20 7.22 2.12x (2048, 4096, 4096) 30.78 13.15 2.34x (32048, 4096, 14336) 3322.26 3201.00 1.04x [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Runtime of dense matmuls on M2 Ultra demon￾strates NPU efficiency. Benefit is reduced for very small (M=64) or very large matrices (M=32048). higher than execution time, making runtime dynamic graph construction on NPU impractical. Each compute graph is unique, components with similar shape but different weights cannot trivially share a single graph [21]. Scenario. Two deployment scenarios exist: (1) Avail… view at source ↗
Figure 9
Figure 9. Figure 9: Our system NPUMoE - incorporates key design (1) Static tiers for expert capacity (§4.1) (2) Grouped Expert Execution (§4.2) and (3) Load aware compute graph residency (§4.3). GPU resources remain available for background tasks while CPU and NPU synchronize between runtime operations. and (2) optimize for resource efficiency. In particular, we seek to achieve target latency goals, improve energy effi￾ciency… view at source ↗
Figure 10
Figure 10. Figure 10: Static tiers for expert capacity. (§4.1) 3.3 Offline Calibration We perform an offline calibration pass that measures per￾layer expert routing statistics and derives an expert popular￾ity ranking. NPUMoE then uses this ranking to configure two runtime decisions: capacity-tier assignment (§4.1) and expert compute graph residency (§4.3). For each layer, we record how often each expert is selected and rank e… view at source ↗
Figure 11
Figure 11. Figure 11: Grouped Expert Execution (§4.2) 4.2 Grouped Expert Execution Following §4.1, a naive, straightforward approach is to allo￾cate a separate compute graph for each expert after estimat￾ing its capacity but this incurs non-trivial dispatch overhead. Issue: Fine-grained per-expert execution launches many small graphs at runtime. Since these launches typically share a single device queue, they are serviced sequ… view at source ↗
Figure 12
Figure 12. Figure 12: Expert placement based on expert popularity on different compute units (§4.3). E0 means expert 0 and so on. token position multiplied by their corresponding routing weights. This layout eliminates dynamic indexing inside the com￾piled graph: expert-to-token assignment is resolved before invocation, and the model only sees a dense, statically shaped tensor. For example, if four experts receive (64, 32, 32,… view at source ↗
Figure 13
Figure 13. Figure 13: Trade-off analysis of latency, energy consump￾tion, and CPU cycle usage for PhiMoE. NPUMoE achieves the optimal pareto frontier compared to all baselines. FP16 before computation [37]. Unless otherwise noted, all comparisons use the same model checkpoint, tokenizer, FP16 precision, and real routing traces from workload. Offline Calibration We profile and evaluate on separate randomized subsets of our test… view at source ↗
Figure 15
Figure 15. Figure 15: Energy consumption breakdown per token across varying prefill chunk sizes (C) and prompt lengths (P) for Phi￾MoE on the M2 Ultra. NPUMoE consistently demonstrates the highest energy efficiency across all configurations. 0 50 100 150 Ours ANEMLL CoreML(naïve) CoreML(CPU) prefill chunk = 512, prompt len. = 1024 0 50 100 150 Ours ANEMLL CoreML(naïve) CoreML(CPU) prefill chunk = 256, prompt len. = 1024 0 150 … view at source ↗
Figure 16
Figure 16. Figure 16: Total CPU cycles consumption across prefill work￾loads for PhiMoE on M2 Ultra (Lower is better). Reduced usage indicate effective offloading of compute to NPU. Our gains persist across both short and long prompts and remain robust as chunk size increases. For short prompts (length=1024), the largest advantage appears at chunk size 512, where CoreML (naïve), CoreML (CPU), and ANEMLL consume 7.37x, 3.19x, a… view at source ↗
Figure 17
Figure 17. Figure 17: Impact of our key techniques. (3) Ours-TG: Ours-T with grouped expert execution; and (4) Ours-all: Ours-TG with load-aware expert compute residency. Our three techniques make a significant contribution to the overall improvement [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗
Figure 19
Figure 19. Figure 19: Runtime latency and energy efficiency of PhiMoe￾tiny running on M2 MAX. A.2 Qwen3-MoE [PITH_FULL_IMAGE:figures/full_fig_p014_19.png] view at source ↗
read the original abstract

Apple Neural Engine (ANE) is a dedicated neural processing unit (NPU) present in every Apple Silicon chip. Mixture-of-Experts (MoE) LLMs improve inference efficiency via sparse activation but are challenging for NPUs in three ways: expert routing is unpredictable and introduces dynamic tensor shapes that conflict with the shape-specific constraints of NPUs; several irregular operators, e.g., top-k, scatter/gather, etc., are not NPU-friendly; and launching many small expert kernels incurs substantial dispatch and synchronization overhead. NPUs are designed to offload AI compute from CPU and GPU; our goal is to enable such offloading for MoE inference, particularly during prefill, where long-context workloads consume substantial system resources. This paper presents NPUMoE, a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation to NPU, while preserving a CPU/GPU fallback path for dynamic operations. NPUMoE uses offline calibration to estimate expert capacity and popularity that drives three key techniques: (1) Static tiers for expert capacity to address dynamic expert routing (2) Grouped expert execution to mitigate NPU concurrency limits (3) Load-aware expert compute graph residency to reduce CPU-NPU synchronization overhead. Experiments on Apple M-series devices using three representative MoE LLMs and four long-context workloads show that NPUMoE consistently outperforms baselines, reducing latency by 1.32x-5.55x, improving energy efficiency by 1.81x-7.37x, and reducing CPU-cycle usage by 1.78x-5.54x through effective NPU offloading.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces NPUMoE, a runtime inference engine for Mixture-of-Experts LLMs on Apple Silicon NPUs. It addresses NPU limitations with MoE (unpredictable routing, irregular operators, dispatch overhead) by using offline calibration to derive static expert capacity tiers and popularity estimates. These drive three techniques: static capacity tiers, grouped expert execution, and load-aware compute graph residency, enabling NPU offloading of dense static computation with CPU/GPU fallback for dynamic parts. Experiments on M-series devices with three MoE LLMs and four long-context workloads report latency reductions of 1.32x-5.55x, energy efficiency gains of 1.81x-7.37x, and CPU-cycle reductions of 1.78x-5.54x.

Significance. If the empirical claims hold under scrutiny, this work would be significant for practical deployment of sparse MoE models on consumer NPUs, particularly for long-context prefill workloads that stress system resources. It offers concrete engineering solutions to make NPU offloading viable for dynamic routing patterns, which could improve accessibility and efficiency of large LLMs on Apple hardware. The paper credits its contributions through specific quantitative gains on representative models and workloads, though the absence of detailed validation limits immediate impact assessment.

major comments (2)
  1. [Offline calibration and techniques description] The offline calibration of static expert capacity tiers and popularity estimates (described as driving all three core techniques) is load-bearing for the claimed speedups, yet no ablation is provided on calibration-test distribution mismatch or measured fallback frequency to CPU/GPU paths when long-context inputs shift activation patterns from the calibration data. This directly affects whether the 1.32x-5.55x latency and 1.81x-7.37x energy gains generalize beyond the four evaluated workloads.
  2. [Experiments] The Experiments section (and abstract) reports specific performance numbers (latency, energy, CPU cycles) across three MoE LLMs and four workloads but supplies no baseline descriptions, workload definitions, error bars, hardware configurations, or statistical details, preventing verification of the central empirical claim that NPUMoE consistently outperforms via effective NPU offloading.
minor comments (1)
  1. [Introduction] The abstract and introduction could more clearly distinguish the NPU-specific constraints (shape-specific kernels, concurrency limits) from general MoE challenges to sharpen the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that additional analysis on calibration robustness and complete experimental details are needed to strengthen the paper. We will revise accordingly.

read point-by-point responses
  1. Referee: [Offline calibration and techniques description] The offline calibration of static expert capacity tiers and popularity estimates (described as driving all three core techniques) is load-bearing for the claimed speedups, yet no ablation is provided on calibration-test distribution mismatch or measured fallback frequency to CPU/GPU paths when long-context inputs shift activation patterns from the calibration data. This directly affects whether the 1.32x-5.55x latency and 1.81x-7.37x energy gains generalize beyond the four evaluated workloads.

    Authors: We agree that demonstrating robustness to distribution shift is important. In the revised manuscript we will add an ablation evaluating NPUMoE on additional long-context inputs drawn from distributions distinct from the calibration set (e.g., different prompt styles and context lengths). We will also report the observed fallback frequency to CPU/GPU paths on the four evaluated workloads, confirming that static tiers keep fallback low. revision: yes

  2. Referee: [Experiments] The Experiments section (and abstract) reports specific performance numbers (latency, energy, CPU cycles) across three MoE LLMs and four workloads but supplies no baseline descriptions, workload definitions, error bars, hardware configurations, or statistical details, preventing verification of the central empirical claim that NPUMoE consistently outperforms via effective NPU offloading.

    Authors: We acknowledge the omission of these details. The revised Experiments section will include: explicit descriptions of all baselines, precise definitions and input characteristics of the four workloads, hardware specifications for the M-series devices, error bars with standard deviations from repeated runs, and any relevant statistical information. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with measured results

full rationale

The manuscript describes an engineering runtime system (NPUMoE) that uses offline calibration to set static expert tiers and popularity estimates, then applies three techniques (static capacity tiers, grouped execution, load-aware graph residency) and reports measured speedups on four workloads against baselines. No equations, first-principles derivations, or fitted predictions are presented as outputs that reduce to the calibration inputs by construction. No self-citation chains or uniqueness theorems are invoked to justify core claims. The contribution is therefore self-contained empirical evaluation rather than a closed derivation loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. Offline calibration is implied to produce fitted popularity and capacity values, but their exact form and fitting procedure are unavailable.

pith-pipeline@v0.9.0 · 5603 in / 1248 out tokens · 46727 ms · 2026-05-10T05:27:57.006915+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 22 canonical work pages · 10 internal anchors

  1. [1]

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. 2024. Phi-4 technical report. arXiv preprint arXiv:2412.08905 (2024)

  2. [2]

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Effi- cient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369 (2023)

  3. [3]

    Osayamen Jonathan Aimuyo, Byungsoo Oh, and Rachee Singh. 2025. FlashMoE: Fast Distributed MoE in a Single Kernel. arXiv preprint arXiv:2506.04667 (2025)

  4. [4]

    Yash Akhauri, Ahmed F AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M Rush, Safeen Huda, and Mohamed S Abdelfattah. 2024. Shadowllm: Predictor-based contextual sparsity for large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 19154–19167

  5. [5]

    ANEMLL Contributors. 2026. ANEMLL: Artificial Neural Engine Machine Learning Library.https://github.com/Anemll/AnemllGitHub repository, version 0.3.5 beta, accessed 2026-04-07

  6. [6]

    antmikinka. 2024. Optimization Guidelines for the Apple Neural Engine (ANE).https://gist.github.com/antmikinka/ 715499ae63630575065b22e5cb6ad8dd. GitHub Gist. Accessed: 2026-04-15

  7. [7]

    Apple Inc. 2023. Apple Reports First Quarter Results. Corporate Report. Apple Inc.https://www.apple.com/newsroom/2023/02/apple- reports-first-quarter-results/

  8. [8]

    Apple Machine Learning Research. 2024. Deploying Attention- Based Vision Transformers to Apple Neural Engine. Apple Machine Learning Research (5 Jan 2024).https://machinelearning.apple.com/ research/vision-transformers

  9. [9]

    Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Zaharia, and Ion Stoica. 2025. Moe- lightning: High-throughput moe inference on memory-constrained gpus. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 715–730

  10. [10]

    Le Chen, Dahu Feng, Erhu Feng, Yingrui Wang, Rong Zhao, Yubin Xia, Pinjie Xu, and Haibo Chen. 2025. Characterizing mobile soc for accelerating heterogeneous llm inference. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 359–374

  11. [11]

    Le Chen, Dahu Feng, Erhu Feng, Rong Zhao, Yingrui Wang, Yubin Xia, Haibo Chen, and Pinjie Xu. 2025. Heterollm: Accelerating large language model inference on mobile socs platform with heterogeneous ai accelerators. arXiv e-prints (2025), arXiv–2501

  12. [12]

    Zhiyang Chen, Daliang Xu, Haiyang Shen, Chiheng Lou, Mengwei Xu, Shangguang Wang, Xin Jin, and Yun Ma. 2025. Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution. arXiv preprint arXiv:2510.15312 (2025)

  13. [13]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short pap...

  14. [14]

    Yuntao Dai, Jing Wu, Hang Gu, and Teng Wang. 2026. Accelerat- ing OpenPangu Inference on NPU via Speculative Decoding. arXiv preprint arXiv:2603.03383 (2026)

  15. [15]

    Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, et al

  16. [16]

    In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles

    Prefillonly: An inference engine for prefill-only workloads in large language model applications. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 399–414

  17. [17]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39

  18. [18]

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The L...

  19. [19]

    Zixu Hao, Jianyu Wei, Tuowei Wang, Minxing Huang, Huiqiang Jiang, Shiqi Jiang, Ting Cao, and Ju Ren. 2025. Scaling llm test-time compute with mobile npu on smartphones. arXiv preprint arXiv:2509.23324 (2025)

  20. [20]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654 (2024)

  21. [21]

    Paul Hübner, Andong Hu, Ivy Peng, and Stefano Markidis. 2025. Ap- ple vs. oranges: Evaluating the apple silicon m-series socs for hpc performance and efficiency. In 2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 45– 54

  22. [22]

    Apple Inc. 2026. Core ML.https://developer.apple.com/machine- learning/core-ml/Accessed: 2026-03-02

  23. [23]

    Apple Inc. 2026. Core ML Models.https://developer.apple.com/ machine-learning/models/Accessed: 2026-03-02

  24. [24]

    Apple Inc. 2026. Deploying Transformers on the Apple Neural En- gine.https://machinelearning.apple.com/research/neural-engine- transformersAccessed: 2026-03-02

  25. [25]

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xu- fang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. 2024. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. Advances in Neural Information Processing Systems 37 (2024), 52481–52515

  26. [26]

    Jaehoon Jung, Jinpyo Kim, and Jaejin Lee. 2023. Deepum: Tensor migration and prefetching in unified memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume2. 207–221

  27. [27]

    Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci

  28. [28]

    In The Thirteenth International Conference on Learning Representations (ICLR)

    Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture- of-Experts Models. In The Thirteenth International Conference on Learning Representations (ICLR). 12 Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

  29. [29]

    Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, and Yunxin Liu. 2024. SwapMoE: Serv- ing Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)

  30. [30]

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668 (2020)

  31. [31]

    Minchong Li, Feng Zhou, and Xiaohui Song. 2025. Bild: Bi-directional logits difference loss for large language model distillation. In Proceedings of the 31st International Conference on Computational Linguistics. 1168–1182

  32. [32]

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434 (2024)

  33. [33]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al . 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

  34. [34]

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. 2023. Deja vu: Contextual sparsity for efficient llms at infer- ence time. In International Conference on Machine Learning. PMLR, 22137–22176

  35. [35]

    Microsoft. 2024. Phi-3.5-MoE-instruct.https://huggingface.co/ microsoft/Phi-3.5-MoE-instructHugging Face model card. Accessed: 2026-04-09

  36. [36]

    Microsoft. 2025. Phi-tiny-MoE-instruct.https://huggingface.co/ microsoft/Phi-tiny-MoE-instructHugging Face model card. Accessed: 2026-04-09

  37. [37]

    Seungjae Moon, Junseo Cha, Hyunjun Park, and Joo-Young Kim

  38. [38]

    In Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA)

    Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window. In Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA). 808–820

  39. [39]

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)

  40. [40]

    Manjeet Singh. 2026. Inside the M4 Apple Neural Engine.https: //maderix.substack.com/p/inside-the-m4-apple-neural-engine-615. maderix’s Substack

  41. [41]

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al

  42. [42]

    Kimi-VL Technical Report

    Kimi-vl technical report. arXiv preprint arXiv:2504.07491 (2025)

  43. [43]

    The ML-Energy Initiative. 2025. zeus-apple-silicon: Zeus device support for Apple Silicon.https://github.com/ml-energy/zeus-apple- siliconGitHub repository

  44. [44]

    Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Haoning Guan, Rui Qu, Maoliang Li, Xiang Chen, and Guojie Luo. 2025. Agent. xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc. arXiv preprint arXiv:2506.24045 (2025)

  45. [45]

    Wikipedia contributors. 2026. Apple Neural Engine.https://en. wikipedia.org/wiki/Neural_EngineAccessed: 2026-03-02

  46. [46]

    Ao Xiao, Bangzheng He, Baoquan Zhang, Baoxing Huai, Bingji Wang, Bo Wang, Bo Xu, Boyi Hou, Chan Yang, Changhong Liu, et al. 2025. xdeepserve: Model-as-a-service on huawei cloudmatrix384. arXiv preprint arXiv:2508.02520 (2025)

  47. [47]

    Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Meng- wei Xu, and Xuanzhe Liu. 2025. Fast on-device llm inference with npus. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 445–462

  48. [48]

    Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. 2024. Moe- infinity: Efficient moe inference on personal machines with sparsity- aware expert cache. arXiv preprint arXiv:2401.14361 (2024)

  49. [49]

    Yuqi Xue, Yiqi Liu, Lifeng Nai, and Jian Huang. 2023. V10: Hardware- assisted npu multi-tenancy for improved resource utilization and fairness. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1–15

  50. [50]

    Yuqi Xue, Yiqi Liu, Lifeng Nai, and Jian Huang. 2024. Hardware- assisted virtualization of neural processing units for cloud plat- forms. In 2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–16

  51. [51]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  52. [52]

    C. Yang, Y. Sui, J. Xiao, et al. 2025. TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multi- modal Vision Language Model. arXiv:2503.18278 [cs.CV]

  53. [53]

    Juheon Yi and Youngki Lee. 2020. Heimdall: mobile GPU coordina- tion platform for augmented reality applications. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking. 1–14

  54. [54]

    Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, and Xuanzhe Liu. 2025. Dynamic sparse attention on mobile socs. arXiv preprint arXiv:2508.16703 (2025)

  55. [55]

    Enda Yu, Zhaoning Zhang, Dezun Dong, Yongwei Wu, and Xi- angke Liao. 2025. PreScope: Unleashing the Power of Prefetch- ing for Resource-Constrained MoE Inference. arXiv preprint arXiv:2509.23638 (2025)

  56. [56]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence?. In Proceedings of the 57th annual meeting of the association for computational linguistics. 4791–4800

  57. [57]

    Zhang, M

    H. Zhang, M. Lyu, C. He, Y. Ao, and Y. Lin. 2025. TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models. arXiv:2509.00320 [cs.CV]

  58. [58]

    Yuxin Zhou, Zheng Li, Jun Zhang, Jue Wang, Yiping Wang, Zhongle Xie, Ke Chen, and Lidan Shou. 2025. FloE: On-the-Fly MoE Inference on Memory-constrained GPU. In Proceedings of the International Conference on Machine Learning (ICML)

  59. [59]

    Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, et al . 2025. Serving large language models on huawei cloudmatrix384. arXiv preprint arXiv:2506.12708 (2025). 13 Afsara Benazir and Felix Xiaozhu Lin A Appendix A.1 Phimoe-tiny 11.776 18.528 11.328 6.304 0 10 20 Prefill 5.6064 6.464 5.824 ...