pith. sign in

arxiv: 2508.18983 · v3 · submitted 2025-08-26 · 💻 cs.AI

SMoE: An Algorithm-System Co-Design for Pushing MoE to the Edge via Expert Substitution

Pith reviewed 2026-05-18 21:36 UTC · model grok-4.3

classification 💻 cs.AI
keywords Mixture of ExpertsEdge InferenceExpert SubstitutionCache ReuseDecoding LatencyGPU OffloadingModel CompressionScheduling Policy
0
0 comments X p. Extension

The pith

SMoE substitutes low-importance experts with cached similar ones to cut MoE decoding latency 48 percent on edge hardware while keeping accuracy nearly lossless.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to run Mixture of Experts models on memory-limited edge devices by replacing some activated experts with similar ones already stored in the GPU cache. This guided substitution reduces the need to fetch data over the slow PCIe connection and pairs with a scheduler that reuses cached experts as often as possible. The result is lower memory pressure and faster inference without much accuracy cost. A sympathetic reader would care because the approach could let larger, more capable sparse models run directly on phones, laptops, and other consumer hardware instead of requiring data-center GPUs.

Core claim

The paper claims that an algorithm-system co-design called SMoE uses expert importance to substitute low-importance activated experts with functionally similar ones already cached in GPU memory, together with a scheduling policy that maximizes the reuse ratio of those cached experts. This combination reduces memory usage and data transfer, largely eliminates PCIe overhead, and delivers 48 percent lower decoding latency with over 60 percent expert cache hit rate while maintaining nearly lossless accuracy.

What carries the argument

Importance-guided expert substitution paired with a reuse-maximizing scheduler for GPU-cached experts.

Load-bearing premise

Low-importance activated experts can be replaced by functionally similar cached experts without significant accuracy degradation.

What would settle it

A direct accuracy comparison on standard language-model benchmarks showing whether perplexity or task scores degrade noticeably when the substitution policy is active versus when every required expert is loaded from host memory.

Figures

Figures reproduced from arXiv: 2508.18983 by Guoying Zhu, Haipeng Dai, Jun Xiao, Keran Li, Ligeng Chen, Meng Li, Weijun Wang, Wei Wang, Xuechen Liu.

Figure 1
Figure 1. Figure 1: Traditional MoE Layer vs. Our Importance-Driven Expert Layer (via substituting low-score experts and prefetch￾ing top-score experts). memory imposes a significantly greater latency penalty than one already in GPU memory. This crucial difference is not ad￾equately considered during MoE’s expert selection process, leading to suboptimal inference performance. Our approach: prioritizing and substituting active… view at source ↗
Figure 3
Figure 3. Figure 3: Online Expert Offloading in MoE LLMs at one layer. Step ○1 : Router selects the active experts. Step ○2 : CPU computes part of the active experts in CPU memory. Step ○3 : Part of active experts and CPU-computed expert results are transferred to GPU memory via PCIe. Step ○4 : GPU processes experts from its memory, consolidating those results with CPU-computed reusults. traditional MoE, where common knowledg… view at source ↗
Figure 4
Figure 4. Figure 4: Only a few achieve high scores, significantly influ￾encing the output, while others have low scores, similar to inactive experts [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Our idea: prefetching top-score experts and replac￾ing low-score experts in each iteration at one layer. the CPU and then aggregating the results with those from the GPU. These two approaches can be pipelined; as [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Time cost of CPU and GPU computing an expert with a token, and PCIE loading an expert from three MoE LLM [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: GPU computing vs CPU computing. non-shared experts are initially kept in CPU memory, allow￾ing for direct computation by the CPU or transfer to GPU memory. Online inference. Online inference is segmented into prefill and decoding phases. In the prefill phase, the system employs a traditional offloading-based LLM approach, transferring experts not initially in GPU memory but needed for com￾putation via PCIe… view at source ↗
Figure 9
Figure 9. Figure 9: Importance-driven expert scheduler pipelines GPU, CPU, and load operations between two MoE layers. MoE transformer layers, 𝑋 and 𝑌, to minimize pipeline bub￾bles. Three processes are pipelined, utilizing different re￾sources: GPU, CPU, and PCIe. CPU: CPU computation is divided into four parts, encom￾passing three processes of the Importance-Driven Expert Scheduler and the computation of expert parameters o… view at source ↗
Figure 11
Figure 11. Figure 11: The reuse probability of experts based on score (in descending order) in three MoE models. constantly available in the GPU memory. The second rea￾son rests on our cache eviction strategy in Section 5.3 that ensures high-scoring, hence important, experts are retained in the cache during the most recent decoding stages. These factors collectively lead to score results that closely align with the true outcom… view at source ↗
Figure 10
Figure 10. Figure 10: Our prefetching method compared to traditional method and normal workflow to output true scores. unshared experts currently in the GPU memory and shared experts, resulting in the production of hidden states. These hidden states are then processed using the next layer’s key￾value cache to complete the attention computation. Sub￾sequently, we carry out a gate computation to determine the scores for all expe… view at source ↗
Figure 12
Figure 12. Figure 12: TPOT of four baselines and our method in five workloads [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: GPU cache ratio of three baselines and our method in five workloads on average. Models. To demonstrate that our method can be applied to various MoE models with the DeepSeekMoE architecture, we evaluate three popular MoE models with the DeepSeekMoE architecture: deepseek-moe-16b-base [6], Qwen2-57B-A14B￾Instruct [7], and XVERSE-MoE-A4.2B-Chat [8]. Although not evaluated, our work also supports other model… view at source ↗
Figure 14
Figure 14. Figure 14: Prefilling time of baselines and SMoE on average. Setting S1 Setting S2 Setting S3 0.2 0.4 0.6 0.8 Load Ratio 0.5620 0.4961 0.2481 0.6590 0.6132 0.1626 0.5927 0.5258 0.3808 0.7025 0.6897 0.2238 0.7430 0.6428 0.4591 0.6649 0.6245 0.1796 top-ratio (0.6514) low-ratio (0.3480) top-ratio (0.5536) low-ratio (0.4464) top-ratio (0.3954) low-ratio (0.6046) base +CE +Pre [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: analyzes the role of each component in our method in enhancing overall performance via a stepwise incremental approach (batch size = 1), with each step re￾ducing TPOT. The baseline method uses CPU-based expert offloading, equivalent to llama.cpp offloading all experts to CPU memory while keeping some experts and common pa￾rameters in GPU memory, but lacks an expert-cache router, prefetching, and caching s… view at source ↗
Figure 17
Figure 17. Figure 17: PCIe time vs TPOT. deepseekmoe xversemoe qwen2 Models 0.7 0.8 0.9 1.0 Accuracy 0.83 0.76 0.87 0.79 0.72 0.80 0.94 0.92 0.96 0.90 0.89 0.94 Ours (If top) Only res (If top) Ours (If activated) Only res (If activated) [PITH_FULL_IMAGE:figures/full_fig_p013_17.png] view at source ↗
read the original abstract

The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SMoE, an algorithm-system co-design for deploying Mixture-of-Experts (MoE) models on edge devices with limited memory. It substitutes low-importance activated experts with functionally similar experts already cached in GPU memory, guided by expert importance, and introduces a scheduling policy to maximize GPU cache reuse. This is claimed to reduce memory usage and PCIe overhead, delivering 48% lower decoding latency, over 60% expert cache hit rate, and nearly lossless accuracy.

Significance. If the substitution mechanism can be shown to preserve output distributions, the work would address a practical bottleneck in edge inference for large MoE models by reducing data movement while retaining model quality. The co-design of importance-based substitution with cache-aware scheduling is a relevant direction for systems research on sparse models.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (48% latency reduction, >60% cache hit rate, nearly lossless accuracy) are presented without any experimental details, baselines, datasets, error bars, or verification that substitutions preserve output distributions. The manuscript must supply these to support the claim that low-importance experts can be replaced without measurable accuracy loss.
  2. The notion of 'functionally similar' experts is invoked to justify substitution but is not defined. The paper must specify the similarity metric (e.g., cosine similarity on weights, activation statistics, or per-token routing scores) and provide evidence—such as measured logit differences or task accuracy before/after substitution—that the replacement leaves the computation graph and final outputs unchanged within a small epsilon for the evaluated queries.
minor comments (2)
  1. Clarify the MoE model sizes, number of experts, and routing top-k values used in the evaluations.
  2. Add a description of the hardware platform, PCIe bandwidth, and exact latency measurement methodology.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's insightful comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation of results and clarify technical details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (48% latency reduction, >60% cache hit rate, nearly lossless accuracy) are presented without any experimental details, baselines, datasets, error bars, or verification that substitutions preserve output distributions. The manuscript must supply these to support the claim that low-importance experts can be replaced without measurable accuracy loss.

    Authors: We agree that the abstract, being concise by nature, omits the full experimental context. The body of the manuscript contains an Evaluation section that reports the full setup, including model configurations, datasets (standard language modeling benchmarks), baselines (e.g., naive expert offloading and cache-only policies), results with error bars across runs, and measurements confirming that substitutions preserve output distributions within small bounds. To better support the abstract claims, we will revise it to include a brief reference to the evaluation methodology and the verification of near-lossless accuracy. revision: yes

  2. Referee: The notion of 'functionally similar' experts is invoked to justify substitution but is not defined. The paper must specify the similarity metric (e.g., cosine similarity on weights, activation statistics, or per-token routing scores) and provide evidence—such as measured logit differences or task accuracy before/after substitution—that the replacement leaves the computation graph and final outputs unchanged within a small epsilon for the evaluated queries.

    Authors: This comment correctly identifies a point that requires clarification. The manuscript uses the term 'functionally similar' but does not explicitly define the metric or present supporting measurements. We will revise the text to define similarity via cosine similarity on expert weight parameters and add empirical evidence, including logit difference statistics and before/after task accuracy results, demonstrating that outputs remain unchanged within a small epsilon for the evaluated queries. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements of substitution policy

full rationale

The paper presents an algorithm-system co-design whose headline results (48% latency reduction, >60% cache hit rate, near-lossless accuracy) are reported as direct experimental outcomes on edge hardware. No equations, fitted parameters, or derivation chain are described that would reduce the substitution benefit to the input assumptions by construction. The core assumption—that low-importance experts can be replaced by functionally similar cached ones—is tested via measurement rather than defined into the result. Any self-citations are incidental and not load-bearing for the empirical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities. The method implicitly assumes measurable expert importance and functional similarity between experts, but these are not formalized or evidenced here.

pith-pipeline@v0.9.0 · 5700 in / 983 out tokens · 36001 ms · 2026-05-18T21:36:59.262083+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 4 internal anchors

  1. [1]

    The documention of Python 3.13: what’s New In Python 3.13? https://docs.python.org/3/whatsnew/3.13.html

    2024. The documention of Python 3.13: what’s New In Python 3.13? https://docs.python.org/3/whatsnew/3.13.html

  2. [2]

    KTransformers: A Flexible Framework for Experiencing Cutting- edge LLM Inference Optimizations

    2024. KTransformers: A Flexible Framework for Experiencing Cutting- edge LLM Inference Optimizations. https://github.com/kvcache-ai/ KTransformers

  3. [3]

    Llama.cpp: a C++ implementation enabling efficient LLM infer- ence on CPUs

    2024. Llama.cpp: a C++ implementation enabling efficient LLM infer- ence on CPUs. https://github.com/ggml-org/llama.cpp

  4. [4]

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets

    2024. OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets. https://https://github. com/open-compass/opencompass

  5. [5]

    This url describes some of the common LLM inference metrics

    2024. This url describes some of the common LLM inference metrics. https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html

  6. [6]

    This url introduces the LLM deepseek-moe-16b-chat

    2024. This url introduces the LLM deepseek-moe-16b-chat. https: //huggingface.co/deepseek-ai/deepseek-moe-16b-chat

  7. [7]

    This url introduces the LLM Qwen2-57B-A14B-Instruct

    2024. This url introduces the LLM Qwen2-57B-A14B-Instruct. https: //huggingface.co/Qwen/Qwen2-57B-A14B-Instruct

  8. [8]

    This url introduces the LLM /XVERSE-MoE-A4.2B-Chat

    2024. This url introduces the LLM /XVERSE-MoE-A4.2B-Chat. https: //huggingface.co/xverse/XVERSE-MoE-A4.2B-Chat

  9. [9]

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. 2022. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis...

  10. [10]

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language. In Thirty-Fourth AAAI Conference on Artificial Intelligence

  11. [11]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044 (2019)

  12. [12]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabhar- wal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457v1 (2018)

  13. [13]

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of- experts language models. arXiv preprint arXiv:2401.06066 (2024)

  14. [14]

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy

  15. [15]

    Hovy , editor =

    RACE: Large-scale ReAding Comprehension Dataset From Ex- aminations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing . Association for Computational Lin- guistics, Copenhagen, Denmark, 785–794. doi: 10.18653/v1/D17-1082

  16. [16]

    Mohammad Taher Pilehvar and Jose Camacho-Collados. 2018. WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. arXiv preprint arXiv:1808.09121 (2018)

  17. [17]

    Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. Powerinfer: Fast large language model serving with a consumer-grade gpu. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 590–606

  18. [18]

    Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. 2024. Moe-infinity: Activation-aware expert offloading for efficient moe serving. arXiv e-prints (2024), arXiv–2401

  19. [19]

    Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, and Meng Li. 2025. HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference. arXiv preprint arXiv:2504.05897 (2025). 13