pith. sign in

arxiv: 2509.23638 · v2 · submitted 2025-09-28 · 💻 cs.LG

LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers

Pith reviewed 2026-05-18 12:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords Mixture-of-Expertsinference optimizationexpert schedulingPCIe offloadingpredictive prefetchingcross-layer schedulingasynchronous I/O
0
0 comments X

The pith

PreScope uses a learnable layer-aware predictor and cross-layer scheduling to deliver 141% higher throughput for MoE models on commodity servers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PreScope as a system to run Mixture-of-Experts models efficiently on standard servers where GPU memory is limited and expert weights must be offloaded to CPU. This offloading creates PCIe transfer delays that far exceed computation time, so the work focuses on predicting which experts will activate in each layer to enable timely prefetching. It combines the Learnable Layer-Aware Predictor to model layer-specific patterns, a global scheduler that balances prefetch costs against loading overhead, and an asynchronous I/O mechanism that hides data movement behind ongoing computation. If these pieces work together, multi-batch inference avoids most idle time and achieves large gains over prior methods.

Core claim

PreScope is a prediction-driven expert scheduling system that addresses inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity through three components: the Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns, Prefetch-Aware Cross-Layer Scheduling (PreSched) that generates globally optimal plans, and Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation, yielding 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

What carries the argument

The Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns to drive prefetch and scheduling decisions.

If this is right

  • Large MoE models become practical to serve on servers that lack high-capacity GPU memory by moving most weights to CPU and moving only needed experts over PCIe.
  • Globally optimal prefetch plans across layers reduce the total data movement cost compared with per-layer greedy decisions.
  • Decoupling I/O from computation through asynchronous operations removes waiting bubbles and raises overall GPU utilization during inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prediction-plus-prefetch pattern could be tested on other memory-bound transformer variants that use conditional computation.
  • On hardware with faster interconnects the relative benefit of accurate prediction might shrink, which could be checked by repeating the measurements on newer servers.
  • Activation patterns may shift after continued fine-tuning, so periodic retraining of the predictor would likely be needed to maintain the reported gains.

Load-bearing premise

The Learnable Layer-Aware Predictor can capture layer-specific expert activation patterns with enough accuracy that the resulting prefetch and scheduling decisions produce net gains rather than added overhead.

What would settle it

Run the same multi-batch MoE workload with the LLaPor predictor replaced by random or static expert selection and measure whether the reported throughput and latency gains disappear.

Figures

Figures reproduced from arXiv: 2509.23638 by Dezun Dong, Dongsheng Li, Enda Yu, Haojie Wang, Weiling Yang, Xiangke Liao, Yongwei Wu, Zhaoning Zhang, Zhe Bai.

Figure 1
Figure 1. Figure 1: Impact of prefetching on inference latency. 1 Introduction Mixture of Experts (MoE) [20, 27, 29, 43] models enhance computational efficiency in large language models via sparse activation, yet face a critical memory bottleneck in resource￾constrained environments [21, 23, 34]. For example, loading all experts of Mixtral-8x7B [20] requires 80GB memory, far exceeding the 32GB capacity of NVIDIA V100 GPU. Exp… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture and inference process of MoE models. 2. Cross-layer prefetch efficiency [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Characterization of expert computation and trans￾fer costs across heterogeneous devices. On-demand loading methods, such as Lina [26] and Ex￾pertFlow [16], exclusively utilize GPUs for expert computa￾tion. If an expert is not prefetched to the GPU, it must be loaded on demand, as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of three existing expert activation prediction strategies. concurrently costs no less than transferring them serially, so expert traffic can be modelled as a sequential process. This observation forms the basis of our inference-cost model. 2.4 MoE Prefetching In MoE inference, prefetching hot experts for subsequent layers increases I/O–computation overlap [10, 52, 57]. Prior methods exploit dist… view at source ↗
Figure 9
Figure 9. Figure 9: Architectural comparison between the standard gating network and LLaPor. 4.2 Learnable Layer-aware Predictor 4.2.1 Network Architecture of LLaPor. Building on the analysis in Section 2.1, we collect the hidden state 𝑎, the indices of the activated experts 𝑒𝑥𝑝𝑒𝑟𝑡𝑖 , and their corre￾sponding weights 𝑤𝑖 per layer as feature variables during the offline phase. At training time, we use the previous-layer featur… view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of PreSched scheduling versus layer￾by-layer scheduling strategies. total number of experts, and BCE refers to the binary cross￾entropy loss BCE(𝑦𝑖 , 𝑝𝑖) = 𝑦𝑖 log(𝑝𝑖) + (1 − 𝑦𝑖) log(1 − 𝑝𝑖). The focal loss term emphasizes hard misclassified examples by reducing the contribution of easy samples through a mod￾ulating factor (1 −𝑝𝑡) 𝛾 , where 𝑝𝑡 is the predicted probability for the true label and … view at source ↗
Figure 11
Figure 11. Figure 11: Mathematical modeling of PreSched balancing latency benefit. Here, 𝛼 is the delay in starting the on-demand operation due to previous prefetching, and 𝑡𝑐 (𝐸𝑎𝑙𝑙 [0 . . . 𝑖]) is the total CPU computational cost of experts 𝐸𝑎𝑙𝑙 [0 . . . 𝑖], which can be calculated using Equation 3. If𝑇 𝐺 𝑎𝑙𝑙 < 𝑇 𝐶 𝑎𝑙𝑙 , add 𝐸𝑎𝑙𝑙 [𝑖 . . . 𝑛+𝑛 ′ ] to the GPU Queue. Since the current and predicted layers may have mis￾aligned lo… view at source ↗
Figure 12
Figure 12. Figure 12: Throughput comparison between PreScope and baseline systems across models and hardware [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Characterization of expert computation and trans￾fer costs across models and heterogeneous devices. same hot-expert table at initialisation, but their parameter￾compression mechanisms remain disabled. Metrics. We measure throughput (tokens generated per unit of generation time) and decoding latency (time per output token). Generation time covers both the prefill and the de￾coding stages. Dataset. Experime… view at source ↗
Figure 14
Figure 14. Figure 14: Decoding latency comparison between PreScope and baseline systems across models and hardware. CPU-GPU collaborative methods, it surpasses HybriMoE by 58.7%, 37.6%, and 55.1%, and outperforms Fiddler by 71.0%, 59.6%, and 97.4%. This advantage stems from PreScope’s accu￾rate expert prefetching mechanism and efficient scheduling optimization, which enable a higher degree of GPU-CPU parallelism. In contrast, … view at source ↗
Figure 16
Figure 16. Figure 16: Prediction accuracy of LLaPor across different models and datasets [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Per-layer time breakdown comparison between PreScope and collaborative inference baseline methods. from affecting subsequent scheduling. Compared to state￾of-the-art gate-based prediction methods, LLaPor improves accuracy by 15%–68.4%. We further examine the impact of LLaPor on end-to-end operation [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗
read the original abstract

Mixture-of-Experts (MoE) models face memory and PCIe latency bottlenecks when deployed on commodity hardware. Offloading expert weights to CPU memory results in PCIe transfer latency that exceeds GPU computation by several folds. We present PreScope, a prediction-driven expert scheduling system that addresses three key challenges: inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity. Our solution includes: 1) Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns; 2) Prefetch-Aware Cross-Layer Scheduling (PreSched) that generates globally optimal plans balancing prefetching costs and loading overhead; 3) Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation, eliminating waiting bubbles. PreScope achieves 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PreScope (titled LayerScope), a prediction-driven scheduling system for Mixture-of-Experts inference on legacy servers. It offloads expert weights to CPU memory and mitigates PCIe latency via three components: the Learnable Layer-Aware Predictor (LLaPor) for layer-specific activation forecasting, Prefetch-Aware Cross-Layer Scheduling (PreSched) for globally optimal prefetch plans, and Asynchronous I/O Optimizer (AsyncIO) to eliminate I/O-compute bubbles. The central claim is a 141% throughput increase and 74.6% latency reduction versus state-of-the-art baselines.

Significance. If the performance numbers are shown to be robust and the predictive component is isolated as the source of gains, the work could meaningfully improve practical MoE deployment on commodity hardware by reducing PCIe pressure without requiring high-end interconnects.

major comments (2)
  1. [Evaluation] Evaluation section: the reported 141% throughput and 74.6% latency gains are presented only as end-to-end results; no ablation replaces LLaPor predictions with static or oracle-free scheduling while retaining PreSched and AsyncIO. Without this comparison it is impossible to confirm that the learned layer-specific forecasts, rather than AsyncIO alone, produce the claimed net PCIe savings after misprediction overhead.
  2. [Abstract] Abstract and results: large performance deltas are stated without describing the experimental setup (models, batch sizes, hardware, baseline implementations, or error bars), preventing assessment of whether the data support the headline claims.
minor comments (2)
  1. Resolve the naming inconsistency between the title (LayerScope) and the system name used throughout the text (PreScope).
  2. Add explicit references to any released code, models, or datasets to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These have helped us identify areas where the manuscript can be strengthened, particularly in evaluation rigor and experimental clarity. We provide point-by-point responses below and have made revisions to address the concerns.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the reported 141% throughput and 74.6% latency gains are presented only as end-to-end results; no ablation replaces LLaPor predictions with static or oracle-free scheduling while retaining PreSched and AsyncIO. Without this comparison it is impossible to confirm that the learned layer-specific forecasts, rather than AsyncIO alone, produce the claimed net PCIe savings after misprediction overhead.

    Authors: We agree that an ablation isolating the contribution of LLaPor is essential to substantiate that the gains arise from the layer-aware predictions rather than AsyncIO in isolation. In the revised manuscript we have added a dedicated ablation subsection in the Evaluation section. This compares the full PreScope system against a variant that substitutes LLaPor with a static (historical-average) scheduler while retaining PreSched and AsyncIO. The new results confirm that the predictive component delivers additional PCIe savings after misprediction overhead is accounted for, thereby strengthening the causal link between LLaPor and the reported end-to-end improvements. revision: yes

  2. Referee: [Abstract] Abstract and results: large performance deltas are stated without describing the experimental setup (models, batch sizes, hardware, baseline implementations, or error bars), preventing assessment of whether the data support the headline claims.

    Authors: We acknowledge the need for greater transparency in the abstract and results presentation. We have revised the abstract to concisely describe the evaluated models, batch-size range, legacy-server hardware configuration, baseline systems, and the reporting of error bars. Corresponding details and error-bar annotations have also been added to the results section and figures. These changes allow readers to directly assess the support for the headline performance numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces PreScope as a system combining LLaPor for layer-specific expert activation prediction, PreSched for cross-layer prefetch scheduling, and AsyncIO for asynchronous I/O optimization, with performance claims (141% throughput, 74.6% latency improvement) presented as outcomes of empirical evaluation on MoE models. No equations, self-citations, or derivations are exhibited in the provided text that reduce any prediction, uniqueness claim, or result to a fitted input or prior self-referential definition by construction. The central claims rest on measured end-to-end gains rather than tautological redefinitions or load-bearing self-citations, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, no fitted constants, and no explicit assumptions or new entities; full text would be required to populate this ledger.

pith-pipeline@v0.9.0 · 5708 in / 984 out tokens · 47149 ms · 2026-05-18T12:34:43.659799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

    cs.LG 2026-04 unverdicted novelty 6.0

    NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Anonymous. 2024. ShareGPT-V3-unfiltered-cleaned-split. Electronic dataset.https://huggingface.co/datasets/learnanything/sharegpt_v3_ unfiltered_cleaned_split

  2. [2]

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al . 2024. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In ACM ASPLOS. 929–947

  3. [3]

    Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Zaharia, and Ion Stoica. 2025. Moe- lightning: High-throughput moe inference on memory-constrained gpus. InACM ASPLOS. 715–730

  4. [4]

    Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, Yuening Zhu, Qingliang Ou, Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Yongwei Wu, and Mingxing Zhang. 2025. KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models. In ACM SOSP. 10–26

  5. [5]

    Le Chen, Dahu Feng, Erhu Feng, Rong Zhao, Yingrui Wang, Yubin Xia, Haibo Chen, and Pinjie Xu. 2025. HeteroLLM: Accelerating Large Lan- guage Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators. arXiv:2501.14794

  6. [6]

    Peizhuang Cong, Aomufei Yuan, Shimao Chen, Yuxuan Tian, Bowen Ye, and Tong Yang. 2024. Prediction is all moe needs: Expert load distribution goes from fluctuating to stabilizing. arXiv:2404.16914

  7. [7]

    Hongchao Du, Shangyu Wu, Arina Kharlamova, Nan Guan, and Chun Jason Xue. 2025. FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference. In EuroMLSys. 56–65

  8. [8]

    Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai Li, and Yiran Chen. 2024. Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models.MLSys6 (2024), 224–238

  9. [9]

    Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. 2025. MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design. arXiv:2505.05799

  10. [10]

    Zhiyuan Fang, Zicong Hong, Yuegui Huang, et al . 2025. Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate. arXiv:2502.12224

  11. [11]

    Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, and Zibin Zheng. 2025. Klotski: Efficient Mixture- of-Expert Inference via Expert-Aware Multi-Batch Pipeline. InACM ASPLOS. 574–588

  12. [12]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR23, 120 (2022), 1–39

  13. [13]

    Elias Frantar and Dan Alistarh. 2023. Qmoe: Practical sub-1-bit com- pression of trillion-parameter models. arXiv:2310.16795

  14. [14]

    Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, and Tao Lin. 2024. Dynamic mixture of experts: An auto-tuning approach for efficient transformer models. arXiv:2405.14297

  15. [15]

    Vima Gupta, Kartik Sinha, Ada Gavrilovska, and Anand Padmanabha Iyer. 2024. Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection. arXiv:2411.08982

  16. [16]

    Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shao- huai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, and Ong Yew Soon. 2024. Expertflow: Optimized expert activation and token alloca- tion for efficient mixture-of-experts inference. arXiv:2410.17954

  17. [17]

    Huanqi Hu, Bowen Xiao, Shixuan Sun, Jianian Yin, Zhexi Zhang, Xi- ang Luo, Chengquan Jiang, Weiqi Xu, Xiaoying Jia, Xin Liu, et al

  18. [18]

    arXiv:2509.01229

    LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving. arXiv:2509.01229

  19. [19]

    Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Shruti Bhosale, Hsien-Hsin Lee, Carole-Jean Wu, and Benjamin Lee. 2024. Toward efficient inference for mixture of experts.NIPS37 (2024), 84033–84059

  20. [20]

    Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. 2024. Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. InIEEE ISCA. 1018–1031

  21. [21]

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts. arXiv:2401.04088

  22. [22]

    Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, and Cheng Li. 2025. BigMac: A Communication- Efficient Mixture-of-Experts Model Structure for Fast Training and Inference. InAAAI. 17689–17698

  23. [23]

    Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci

  24. [24]

    Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture- of-Experts Models. InICLR. 56099–56115

  25. [25]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  26. [26]

    InACM SOSP

    Efficient memory management for large language model serving with pagedattention. InACM SOSP. 611–626

  27. [27]

    Xinlu Lai. 2024. The DPO Dataset for Chinese and English with emoji. https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji

  28. [28]

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient generative inference of large language models with dynamic{KV}cache management. InOSDI. 155–172

  29. [29]

    Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. 2023. Accelerating distributed MoE training and inference with lina. In USENIX ATC 23. 945–959

  30. [30]

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv:2405.04434

  31. [31]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning.NIPS36 (2023), 34892–34916

  32. [32]

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, et al. 2025. Muon is Scalable for LLM Training. arXiv:2502.16982

  33. [33]

    Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. 2024. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models. arXiv:2402.14800

  34. [34]

    Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, et al . 2025. LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference. InISCA. 514–528

  35. [35]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, et al . 2024. GPT-4 Technical Report. arXiv:2303.08774

  36. [36]

    Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, and Jie Zhang. 2025. InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference. InIEEE HPCA. 1510–1525

  37. [37]

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yux- iong He. 2022. Deepspeed-moe: Advancing mixture-of-experts infer- ence and training to power next-generation ai scale. InICML. 18332– 18346

  38. [38]

    Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. InACM SC. 1–14

  39. [39]

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ra- mani, and Tri Dao. 2024. Flashattention-3: Fast and accurate attention Conference’26, xx 2026, xx Enda Yu, Zhaoning Zhang *, Dezun DONG*, Yongwei Wu, Xiangke Liao with asynchrony and low-precision.NIPS37 (2024), 68658–68685

  40. [40]

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang

  41. [41]

    Flexgen: High-throughput generative inference of large language models with a single gpu. InICML. 31094–31116

  42. [42]

    Xiaoniu Song, Zihang Zhong, Rong Chen, and Haibo Chen. 2024. Promoe: Fast moe-based llm serving using proactive caching. arXiv:2410.22134

  43. [43]

    Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. In ACM SOSP. 590–606

  44. [44]

    Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, and Max Ryabinin. 2024. Specexec: Massively parallel speculative decoding for interactive llm inference on consumer devices.NIPS37 (2024), 16342–16368

  45. [45]

    Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng- Ann Heng, Chao Li, and Minyi Guo. 2024. Hobbit: A mixed precision expert offloading system for fast moe inference. arXiv:2411.01433

  46. [46]

    Wei Tao, Haocheng Lu, Xiaoyang Qu, Bin Zhang, Kai Lu, Jiguang Wan, and Jianzong Wang. 2025. MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts. arXiv:2506.07533

  47. [47]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  48. [48]

    Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023. Tabi: An efficient multi-level inference system for large language models. In EuroSys. 233–248

  49. [49]

    Yuanxin Wei, Jiangsu Du, Jiazhi Jiang, Xiao Shi, Xianwei Zhang, Dan Huang, Nong Xiao, and Yutong Lu. 2024. APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes. InIEEE SC. 1–14

  50. [50]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al . 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv:1910.03771

  51. [51]

    Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2025. EdgeLLM: Fast On-Device LLM Inference With Speculative Decoding.IEEE TMC24, 4 (2025), 3256–3273

  52. [52]

    Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, and Luo Mai. 2025. MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching. arXiv:2503.09716

  53. [53]

    Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchun- shu Zhou, and Yang You. 2024. OpenMoE: an early effort on open mixture-of-experts language models. InICML. 55625–55655

  54. [54]

    Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. 2025. MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache. arXiv:2401.14361

  55. [55]

    Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, and Dhabaleswar K DK Panda. 2024. Exploiting inter-layer expert affinity for accelerating mixture-of-experts model inference. InIEEE IPDPS. 915–925

  56. [56]

    Hanfei Yu, Xingqi Cui, Hong Zhang, and Hao Wang. 2025. fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving. arXiv:2502.05370

  57. [57]

    Libo Zhang, Zhaoning Zhang, Baizhou Xu, Songzhu Mei, and Dong- sheng Li. 2025. Dovetail: A cpu/gpu heterogeneous speculative decod- ing for llm inference. InEMNLP. 1–13

  58. [58]

    Yujie Zhang, Shivam Aggarwal, and Tulika Mitra. 2025. DAOP: Data- Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference. InIEEE DATE. 1–7

  59. [59]

    Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, and Yang You. 2024. Hetegen: Efficient heterogeneous parallel inference for large language models on resource-constrained devices.MLSys6 (2024), 162–172

  60. [60]

    Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, and Meng Li. 2024. AdapMoE: Adaptive sensitivity-based expert gating and management for efficient moe inference. InIEEE ICCAD. 1–9

  61. [61]

    Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, and Meng Li. 2025. HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference. InDAC. 1–7