LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers

Dezun Dong; Dongsheng Li; Enda Yu; Haojie Wang; Weiling Yang; Xiangke Liao; Yongwei Wu; Zhaoning Zhang; Zhe Bai

arxiv: 2509.23638 · v2 · submitted 2025-09-28 · 💻 cs.LG

LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers

Enda Yu , Dezun Dong , Zhaoning Zhang , Zhe Bai , Weiling Yang , Haojie Wang , Dongsheng Li , Yongwei Wu

show 1 more author

Xiangke Liao

This is my paper

Pith reviewed 2026-05-18 12:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords Mixture-of-Expertsinference optimizationexpert schedulingPCIe offloadingpredictive prefetchingcross-layer schedulingasynchronous I/O

0 comments

The pith

PreScope uses a learnable layer-aware predictor and cross-layer scheduling to deliver 141% higher throughput for MoE models on commodity servers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PreScope as a system to run Mixture-of-Experts models efficiently on standard servers where GPU memory is limited and expert weights must be offloaded to CPU. This offloading creates PCIe transfer delays that far exceed computation time, so the work focuses on predicting which experts will activate in each layer to enable timely prefetching. It combines the Learnable Layer-Aware Predictor to model layer-specific patterns, a global scheduler that balances prefetch costs against loading overhead, and an asynchronous I/O mechanism that hides data movement behind ongoing computation. If these pieces work together, multi-batch inference avoids most idle time and achieves large gains over prior methods.

Core claim

PreScope is a prediction-driven expert scheduling system that addresses inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity through three components: the Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns, Prefetch-Aware Cross-Layer Scheduling (PreSched) that generates globally optimal plans, and Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation, yielding 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

What carries the argument

The Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns to drive prefetch and scheduling decisions.

If this is right

Large MoE models become practical to serve on servers that lack high-capacity GPU memory by moving most weights to CPU and moving only needed experts over PCIe.
Globally optimal prefetch plans across layers reduce the total data movement cost compared with per-layer greedy decisions.
Decoupling I/O from computation through asynchronous operations removes waiting bubbles and raises overall GPU utilization during inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prediction-plus-prefetch pattern could be tested on other memory-bound transformer variants that use conditional computation.
On hardware with faster interconnects the relative benefit of accurate prediction might shrink, which could be checked by repeating the measurements on newer servers.
Activation patterns may shift after continued fine-tuning, so periodic retraining of the predictor would likely be needed to maintain the reported gains.

Load-bearing premise

The Learnable Layer-Aware Predictor can capture layer-specific expert activation patterns with enough accuracy that the resulting prefetch and scheduling decisions produce net gains rather than added overhead.

What would settle it

Run the same multi-batch MoE workload with the LLaPor predictor replaced by random or static expert selection and measure whether the reported throughput and latency gains disappear.

Figures

Figures reproduced from arXiv: 2509.23638 by Dezun Dong, Dongsheng Li, Enda Yu, Haojie Wang, Weiling Yang, Xiangke Liao, Yongwei Wu, Zhaoning Zhang, Zhe Bai.

**Figure 1.** Figure 1: Impact of prefetching on inference latency. 1 Introduction Mixture of Experts (MoE) [20, 27, 29, 43] models enhance computational efficiency in large language models via sparse activation, yet face a critical memory bottleneck in resourceconstrained environments [21, 23, 34]. For example, loading all experts of Mixtral-8x7B [20] requires 80GB memory, far exceeding the 32GB capacity of NVIDIA V100 GPU. Exp… view at source ↗

**Figure 3.** Figure 3: Architecture and inference process of MoE models. 2. Cross-layer prefetch efficiency [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 5.** Figure 5: Characterization of expert computation and transfer costs across heterogeneous devices. On-demand loading methods, such as Lina [26] and ExpertFlow [16], exclusively utilize GPUs for expert computation. If an expert is not prefetched to the GPU, it must be loaded on demand, as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of three existing expert activation prediction strategies. concurrently costs no less than transferring them serially, so expert traffic can be modelled as a sequential process. This observation forms the basis of our inference-cost model. 2.4 MoE Prefetching In MoE inference, prefetching hot experts for subsequent layers increases I/O–computation overlap [10, 52, 57]. Prior methods exploit dist… view at source ↗

**Figure 9.** Figure 9: Architectural comparison between the standard gating network and LLaPor. 4.2 Learnable Layer-aware Predictor 4.2.1 Network Architecture of LLaPor. Building on the analysis in Section 2.1, we collect the hidden state 𝑎, the indices of the activated experts 𝑒𝑥𝑝𝑒𝑟𝑡𝑖 , and their corresponding weights 𝑤𝑖 per layer as feature variables during the offline phase. At training time, we use the previous-layer featur… view at source ↗

**Figure 10.** Figure 10: Comparison of PreSched scheduling versus layerby-layer scheduling strategies. total number of experts, and BCE refers to the binary crossentropy loss BCE(𝑦𝑖 , 𝑝𝑖) = 𝑦𝑖 log(𝑝𝑖) + (1 − 𝑦𝑖) log(1 − 𝑝𝑖). The focal loss term emphasizes hard misclassified examples by reducing the contribution of easy samples through a modulating factor (1 −𝑝𝑡) 𝛾 , where 𝑝𝑡 is the predicted probability for the true label and … view at source ↗

**Figure 11.** Figure 11: Mathematical modeling of PreSched balancing latency benefit. Here, 𝛼 is the delay in starting the on-demand operation due to previous prefetching, and 𝑡𝑐 (𝐸𝑎𝑙𝑙 [0 . . . 𝑖]) is the total CPU computational cost of experts 𝐸𝑎𝑙𝑙 [0 . . . 𝑖], which can be calculated using Equation 3. If𝑇 𝐺 𝑎𝑙𝑙 < 𝑇 𝐶 𝑎𝑙𝑙 , add 𝐸𝑎𝑙𝑙 [𝑖 . . . 𝑛+𝑛 ′ ] to the GPU Queue. Since the current and predicted layers may have misaligned lo… view at source ↗

**Figure 12.** Figure 12: Throughput comparison between PreScope and baseline systems across models and hardware [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 13.** Figure 13: Characterization of expert computation and transfer costs across models and heterogeneous devices. same hot-expert table at initialisation, but their parametercompression mechanisms remain disabled. Metrics. We measure throughput (tokens generated per unit of generation time) and decoding latency (time per output token). Generation time covers both the prefill and the decoding stages. Dataset. Experime… view at source ↗

**Figure 14.** Figure 14: Decoding latency comparison between PreScope and baseline systems across models and hardware. CPU-GPU collaborative methods, it surpasses HybriMoE by 58.7%, 37.6%, and 55.1%, and outperforms Fiddler by 71.0%, 59.6%, and 97.4%. This advantage stems from PreScope’s accurate expert prefetching mechanism and efficient scheduling optimization, which enable a higher degree of GPU-CPU parallelism. In contrast, … view at source ↗

**Figure 16.** Figure 16: Prediction accuracy of LLaPor across different models and datasets [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗

**Figure 17.** Figure 17: Per-layer time breakdown comparison between PreScope and collaborative inference baseline methods. from affecting subsequent scheduling. Compared to stateof-the-art gate-based prediction methods, LLaPor improves accuracy by 15%–68.4%. We further examine the impact of LLaPor on end-to-end operation [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models face memory and PCIe latency bottlenecks when deployed on commodity hardware. Offloading expert weights to CPU memory results in PCIe transfer latency that exceeds GPU computation by several folds. We present PreScope, a prediction-driven expert scheduling system that addresses three key challenges: inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity. Our solution includes: 1) Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns; 2) Prefetch-Aware Cross-Layer Scheduling (PreSched) that generates globally optimal plans balancing prefetching costs and loading overhead; 3) Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation, eliminating waiting bubbles. PreScope achieves 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PreScope combines a layer-aware predictor with cross-layer prefetch scheduling and async IO for MoE offloading, but the large gains rest on an unablated assumption that the predictions deliver net savings.

read the letter

The main point is that this paper presents PreScope for handling expert offloading in MoE models on commodity servers. It uses a Learnable Layer-Aware Predictor to forecast per-layer activations, feeds those into Prefetch-Aware Cross-Layer Scheduling for global plans, and adds Asynchronous I/O to overlap transfers with compute. The result is claimed to cut PCIe stalls that otherwise dominate runtime when expert weights live in CPU memory. The approach is a direct engineering response to a deployment constraint that many groups actually face when they cannot buy new accelerators. The breakdown of the three challenges and the component solutions is clear and practical. The work earns credit for focusing on legacy hardware rather than assuming ideal GPU clusters. The soft spot is exactly the one the stress-test flags. The 141% throughput and 74.6% latency numbers require that LLaPor forecasts are accurate enough for PreSched decisions to produce real PCIe savings after misprediction overhead and extra scheduler work. No ablation that disables the predictor while retaining PreSched and AsyncIO is described in the abstract, and the full text does not appear to supply one either. Without that isolation, it is hard to know whether the predictor is the driver or whether the scheduling and async pieces would have delivered most of the improvement on their own. The experimental setup details are also thin in the provided summary, which makes reproducibility harder to judge. This paper is for systems researchers and engineers who optimize inference for sparse models on non-ideal hardware. A practitioner looking for scheduling patterns to adapt would get usable ideas. It deserves a serious referee because the problem is timely and the system description is concrete enough to review, even though revisions will almost certainly be needed for stronger evidence on the predictor. I would send it out with a request for the missing ablation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PreScope (titled LayerScope), a prediction-driven scheduling system for Mixture-of-Experts inference on legacy servers. It offloads expert weights to CPU memory and mitigates PCIe latency via three components: the Learnable Layer-Aware Predictor (LLaPor) for layer-specific activation forecasting, Prefetch-Aware Cross-Layer Scheduling (PreSched) for globally optimal prefetch plans, and Asynchronous I/O Optimizer (AsyncIO) to eliminate I/O-compute bubbles. The central claim is a 141% throughput increase and 74.6% latency reduction versus state-of-the-art baselines.

Significance. If the performance numbers are shown to be robust and the predictive component is isolated as the source of gains, the work could meaningfully improve practical MoE deployment on commodity hardware by reducing PCIe pressure without requiring high-end interconnects.

major comments (2)

[Evaluation] Evaluation section: the reported 141% throughput and 74.6% latency gains are presented only as end-to-end results; no ablation replaces LLaPor predictions with static or oracle-free scheduling while retaining PreSched and AsyncIO. Without this comparison it is impossible to confirm that the learned layer-specific forecasts, rather than AsyncIO alone, produce the claimed net PCIe savings after misprediction overhead.
[Abstract] Abstract and results: large performance deltas are stated without describing the experimental setup (models, batch sizes, hardware, baseline implementations, or error bars), preventing assessment of whether the data support the headline claims.

minor comments (2)

Resolve the naming inconsistency between the title (LayerScope) and the system name used throughout the text (PreScope).
Add explicit references to any released code, models, or datasets to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These have helped us identify areas where the manuscript can be strengthened, particularly in evaluation rigor and experimental clarity. We provide point-by-point responses below and have made revisions to address the concerns.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported 141% throughput and 74.6% latency gains are presented only as end-to-end results; no ablation replaces LLaPor predictions with static or oracle-free scheduling while retaining PreSched and AsyncIO. Without this comparison it is impossible to confirm that the learned layer-specific forecasts, rather than AsyncIO alone, produce the claimed net PCIe savings after misprediction overhead.

Authors: We agree that an ablation isolating the contribution of LLaPor is essential to substantiate that the gains arise from the layer-aware predictions rather than AsyncIO in isolation. In the revised manuscript we have added a dedicated ablation subsection in the Evaluation section. This compares the full PreScope system against a variant that substitutes LLaPor with a static (historical-average) scheduler while retaining PreSched and AsyncIO. The new results confirm that the predictive component delivers additional PCIe savings after misprediction overhead is accounted for, thereby strengthening the causal link between LLaPor and the reported end-to-end improvements. revision: yes
Referee: [Abstract] Abstract and results: large performance deltas are stated without describing the experimental setup (models, batch sizes, hardware, baseline implementations, or error bars), preventing assessment of whether the data support the headline claims.

Authors: We acknowledge the need for greater transparency in the abstract and results presentation. We have revised the abstract to concisely describe the evaluated models, batch-size range, legacy-server hardware configuration, baseline systems, and the reporting of error bars. Corresponding details and error-bar annotations have also been added to the results section and figures. These changes allow readers to directly assess the support for the headline performance numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces PreScope as a system combining LLaPor for layer-specific expert activation prediction, PreSched for cross-layer prefetch scheduling, and AsyncIO for asynchronous I/O optimization, with performance claims (141% throughput, 74.6% latency improvement) presented as outcomes of empirical evaluation on MoE models. No equations, self-citations, or derivations are exhibited in the provided text that reduce any prediction, uniqueness claim, or result to a fitted input or prior self-referential definition by construction. The central claims rest on measured end-to-end gains rather than tautological redefinitions or load-bearing self-citations, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, no fitted constants, and no explicit assumptions or new entities; full text would be required to populate this ledger.

pith-pipeline@v0.9.0 · 5708 in / 984 out tokens · 47149 ms · 2026-05-18T12:34:43.659799+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
cs.LG 2026-04 unverdicted novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Anonymous. 2024. ShareGPT-V3-unfiltered-cleaned-split. Electronic dataset.https://huggingface.co/datasets/learnanything/sharegpt_v3_ unfiltered_cleaned_split

work page 2024
[2]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al . 2024. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In ACM ASPLOS. 929–947

work page 2024
[3]

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Zaharia, and Ion Stoica. 2025. Moe- lightning: High-throughput moe inference on memory-constrained gpus. InACM ASPLOS. 715–730

work page 2025
[4]

Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, Yuening Zhu, Qingliang Ou, Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Yongwei Wu, and Mingxing Zhang. 2025. KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models. In ACM SOSP. 10–26

work page 2025
[5]

Le Chen, Dahu Feng, Erhu Feng, Rong Zhao, Yingrui Wang, Yubin Xia, Haibo Chen, and Pinjie Xu. 2025. HeteroLLM: Accelerating Large Lan- guage Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators. arXiv:2501.14794

work page arXiv 2025
[6]

Peizhuang Cong, Aomufei Yuan, Shimao Chen, Yuxuan Tian, Bowen Ye, and Tong Yang. 2024. Prediction is all moe needs: Expert load distribution goes from fluctuating to stabilizing. arXiv:2404.16914

work page arXiv 2024
[7]

Hongchao Du, Shangyu Wu, Arina Kharlamova, Nan Guan, and Chun Jason Xue. 2025. FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference. In EuroMLSys. 56–65

work page 2025
[8]

Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai Li, and Yiran Chen. 2024. Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models.MLSys6 (2024), 224–238

work page 2024
[9]

Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. 2025. MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design. arXiv:2505.05799

work page arXiv 2025
[10]

Zhiyuan Fang, Zicong Hong, Yuegui Huang, et al . 2025. Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate. arXiv:2502.12224

work page arXiv 2025
[11]

Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, and Zibin Zheng. 2025. Klotski: Efficient Mixture- of-Expert Inference via Expert-Aware Multi-Batch Pipeline. InACM ASPLOS. 574–588

work page 2025
[12]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR23, 120 (2022), 1–39

work page 2022
[13]

Elias Frantar and Dan Alistarh. 2023. Qmoe: Practical sub-1-bit com- pression of trillion-parameter models. arXiv:2310.16795

work page arXiv 2023
[14]

Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, and Tao Lin. 2024. Dynamic mixture of experts: An auto-tuning approach for efficient transformer models. arXiv:2405.14297

work page arXiv 2024
[15]

Vima Gupta, Kartik Sinha, Ada Gavrilovska, and Anand Padmanabha Iyer. 2024. Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection. arXiv:2411.08982

work page internal anchor Pith review arXiv 2024
[16]

Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shao- huai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, and Ong Yew Soon. 2024. Expertflow: Optimized expert activation and token alloca- tion for efficient mixture-of-experts inference. arXiv:2410.17954

work page arXiv 2024
[17]

Huanqi Hu, Bowen Xiao, Shixuan Sun, Jianian Yin, Zhexi Zhang, Xi- ang Luo, Chengquan Jiang, Weiqi Xu, Xiaoying Jia, Xin Liu, et al

work page
[18]

arXiv:2509.01229

LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving. arXiv:2509.01229

work page arXiv
[19]

Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Shruti Bhosale, Hsien-Hsin Lee, Carole-Jean Wu, and Benjamin Lee. 2024. Toward efficient inference for mixture of experts.NIPS37 (2024), 84033–84059

work page 2024
[20]

Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. 2024. Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. InIEEE ISCA. 1018–1031

work page 2024
[21]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts. arXiv:2401.04088

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, and Cheng Li. 2025. BigMac: A Communication- Efficient Mixture-of-Experts Model Structure for Fast Training and Inference. InAAAI. 17689–17698

work page 2025
[23]

Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci

work page
[24]

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture- of-Experts Models. InICLR. 56099–56115

work page
[25]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page
[26]

InACM SOSP

Efficient memory management for large language model serving with pagedattention. InACM SOSP. 611–626

work page
[27]

Xinlu Lai. 2024. The DPO Dataset for Chinese and English with emoji. https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji

work page 2024
[28]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient generative inference of large language models with dynamic{KV}cache management. InOSDI. 155–172

work page 2024
[29]

Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. 2023. Accelerating distributed MoE training and inference with lina. In USENIX ATC 23. 945–959

work page 2023
[30]

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv:2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning.NIPS36 (2023), 34892–34916

work page 2023
[32]

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, et al. 2025. Muon is Scalable for LLM Training. arXiv:2502.16982

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. 2024. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models. arXiv:2402.14800

work page arXiv 2024
[34]

Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, et al . 2025. LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference. InISCA. 514–528

work page 2025
[35]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, et al . 2024. GPT-4 Technical Report. arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, and Jie Zhang. 2025. InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference. InIEEE HPCA. 1510–1525

work page 2025
[37]

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yux- iong He. 2022. Deepspeed-moe: Advancing mixture-of-experts infer- ence and training to power next-generation ai scale. InICML. 18332– 18346

work page 2022
[38]

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. InACM SC. 1–14

work page 2021
[39]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ra- mani, and Tri Dao. 2024. Flashattention-3: Fast and accurate attention Conference’26, xx 2026, xx Enda Yu, Zhaoning Zhang *, Dezun DONG*, Yongwei Wu, Xiangke Liao with asynchrony and low-precision.NIPS37 (2024), 68658–68685

work page 2024
[40]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang

work page
[41]

Flexgen: High-throughput generative inference of large language models with a single gpu. InICML. 31094–31116

work page
[42]

Xiaoniu Song, Zihang Zhong, Rong Chen, and Haibo Chen. 2024. Promoe: Fast moe-based llm serving using proactive caching. arXiv:2410.22134

work page arXiv 2024
[43]

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. In ACM SOSP. 590–606

work page 2024
[44]

Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, and Max Ryabinin. 2024. Specexec: Massively parallel speculative decoding for interactive llm inference on consumer devices.NIPS37 (2024), 16342–16368

work page 2024
[45]

Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng- Ann Heng, Chao Li, and Minyi Guo. 2024. Hobbit: A mixed precision expert offloading system for fast moe inference. arXiv:2411.01433

work page arXiv 2024
[46]

Wei Tao, Haocheng Lu, Xiaoyang Qu, Bin Zhang, Kai Lu, Jiguang Wan, and Jianzong Wang. 2025. MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts. arXiv:2506.07533

work page arXiv 2025
[47]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023. Tabi: An efficient multi-level inference system for large language models. In EuroSys. 233–248

work page 2023
[49]

Yuanxin Wei, Jiangsu Du, Jiazhi Jiang, Xiao Shi, Xianwei Zhang, Dan Huang, Nong Xiao, and Yutong Lu. 2024. APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes. InIEEE SC. 1–14

work page 2024
[50]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al . 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv:1910.03771

work page internal anchor Pith review Pith/arXiv arXiv 2019
[51]

Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2025. EdgeLLM: Fast On-Device LLM Inference With Speculative Decoding.IEEE TMC24, 4 (2025), 3256–3273

work page 2025
[52]

Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, and Luo Mai. 2025. MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching. arXiv:2503.09716

work page arXiv 2025
[53]

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchun- shu Zhou, and Yang You. 2024. OpenMoE: an early effort on open mixture-of-experts language models. InICML. 55625–55655

work page 2024
[54]

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. 2025. MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache. arXiv:2401.14361

work page arXiv 2025
[55]

Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, and Dhabaleswar K DK Panda. 2024. Exploiting inter-layer expert affinity for accelerating mixture-of-experts model inference. InIEEE IPDPS. 915–925

work page 2024
[56]

Hanfei Yu, Xingqi Cui, Hong Zhang, and Hao Wang. 2025. fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving. arXiv:2502.05370

work page arXiv 2025
[57]

Libo Zhang, Zhaoning Zhang, Baizhou Xu, Songzhu Mei, and Dong- sheng Li. 2025. Dovetail: A cpu/gpu heterogeneous speculative decod- ing for llm inference. InEMNLP. 1–13

work page 2025
[58]

Yujie Zhang, Shivam Aggarwal, and Tulika Mitra. 2025. DAOP: Data- Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference. InIEEE DATE. 1–7

work page 2025
[59]

Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, and Yang You. 2024. Hetegen: Efficient heterogeneous parallel inference for large language models on resource-constrained devices.MLSys6 (2024), 162–172

work page 2024
[60]

Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, and Meng Li. 2024. AdapMoE: Adaptive sensitivity-based expert gating and management for efficient moe inference. InIEEE ICCAD. 1–9

work page 2024
[61]

Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, and Meng Li. 2025. HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference. InDAC. 1–7

work page 2025

[1] [1]

Anonymous. 2024. ShareGPT-V3-unfiltered-cleaned-split. Electronic dataset.https://huggingface.co/datasets/learnanything/sharegpt_v3_ unfiltered_cleaned_split

work page 2024

[2] [2]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al . 2024. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In ACM ASPLOS. 929–947

work page 2024

[3] [3]

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Zaharia, and Ion Stoica. 2025. Moe- lightning: High-throughput moe inference on memory-constrained gpus. InACM ASPLOS. 715–730

work page 2025

[4] [4]

Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, Yuening Zhu, Qingliang Ou, Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Yongwei Wu, and Mingxing Zhang. 2025. KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models. In ACM SOSP. 10–26

work page 2025

[5] [5]

Le Chen, Dahu Feng, Erhu Feng, Rong Zhao, Yingrui Wang, Yubin Xia, Haibo Chen, and Pinjie Xu. 2025. HeteroLLM: Accelerating Large Lan- guage Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators. arXiv:2501.14794

work page arXiv 2025

[6] [6]

Peizhuang Cong, Aomufei Yuan, Shimao Chen, Yuxuan Tian, Bowen Ye, and Tong Yang. 2024. Prediction is all moe needs: Expert load distribution goes from fluctuating to stabilizing. arXiv:2404.16914

work page arXiv 2024

[7] [7]

Hongchao Du, Shangyu Wu, Arina Kharlamova, Nan Guan, and Chun Jason Xue. 2025. FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference. In EuroMLSys. 56–65

work page 2025

[8] [8]

Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai Li, and Yiran Chen. 2024. Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models.MLSys6 (2024), 224–238

work page 2024

[9] [9]

Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. 2025. MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design. arXiv:2505.05799

work page arXiv 2025

[10] [10]

Zhiyuan Fang, Zicong Hong, Yuegui Huang, et al . 2025. Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate. arXiv:2502.12224

work page arXiv 2025

[11] [11]

Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, and Zibin Zheng. 2025. Klotski: Efficient Mixture- of-Expert Inference via Expert-Aware Multi-Batch Pipeline. InACM ASPLOS. 574–588

work page 2025

[12] [12]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR23, 120 (2022), 1–39

work page 2022

[13] [13]

Elias Frantar and Dan Alistarh. 2023. Qmoe: Practical sub-1-bit com- pression of trillion-parameter models. arXiv:2310.16795

work page arXiv 2023

[14] [14]

Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, and Tao Lin. 2024. Dynamic mixture of experts: An auto-tuning approach for efficient transformer models. arXiv:2405.14297

work page arXiv 2024

[15] [15]

Vima Gupta, Kartik Sinha, Ada Gavrilovska, and Anand Padmanabha Iyer. 2024. Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection. arXiv:2411.08982

work page internal anchor Pith review arXiv 2024

[16] [16]

Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shao- huai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, and Ong Yew Soon. 2024. Expertflow: Optimized expert activation and token alloca- tion for efficient mixture-of-experts inference. arXiv:2410.17954

work page arXiv 2024

[17] [17]

Huanqi Hu, Bowen Xiao, Shixuan Sun, Jianian Yin, Zhexi Zhang, Xi- ang Luo, Chengquan Jiang, Weiqi Xu, Xiaoying Jia, Xin Liu, et al

work page

[18] [18]

arXiv:2509.01229

LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving. arXiv:2509.01229

work page arXiv

[19] [19]

Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Shruti Bhosale, Hsien-Hsin Lee, Carole-Jean Wu, and Benjamin Lee. 2024. Toward efficient inference for mixture of experts.NIPS37 (2024), 84033–84059

work page 2024

[20] [20]

Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. 2024. Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. InIEEE ISCA. 1018–1031

work page 2024

[21] [21]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts. arXiv:2401.04088

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, and Cheng Li. 2025. BigMac: A Communication- Efficient Mixture-of-Experts Model Structure for Fast Training and Inference. InAAAI. 17689–17698

work page 2025

[23] [23]

Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci

work page

[24] [24]

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture- of-Experts Models. InICLR. 56099–56115

work page

[25] [25]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page

[26] [26]

InACM SOSP

Efficient memory management for large language model serving with pagedattention. InACM SOSP. 611–626

work page

[27] [27]

Xinlu Lai. 2024. The DPO Dataset for Chinese and English with emoji. https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji

work page 2024

[28] [28]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient generative inference of large language models with dynamic{KV}cache management. InOSDI. 155–172

work page 2024

[29] [29]

Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. 2023. Accelerating distributed MoE training and inference with lina. In USENIX ATC 23. 945–959

work page 2023

[30] [30]

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv:2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning.NIPS36 (2023), 34892–34916

work page 2023

[32] [32]

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, et al. 2025. Muon is Scalable for LLM Training. arXiv:2502.16982

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. 2024. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models. arXiv:2402.14800

work page arXiv 2024

[34] [34]

Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, et al . 2025. LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference. InISCA. 514–528

work page 2025

[35] [35]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, et al . 2024. GPT-4 Technical Report. arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, and Jie Zhang. 2025. InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference. InIEEE HPCA. 1510–1525

work page 2025

[37] [37]

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yux- iong He. 2022. Deepspeed-moe: Advancing mixture-of-experts infer- ence and training to power next-generation ai scale. InICML. 18332– 18346

work page 2022

[38] [38]

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. InACM SC. 1–14

work page 2021

[39] [39]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ra- mani, and Tri Dao. 2024. Flashattention-3: Fast and accurate attention Conference’26, xx 2026, xx Enda Yu, Zhaoning Zhang *, Dezun DONG*, Yongwei Wu, Xiangke Liao with asynchrony and low-precision.NIPS37 (2024), 68658–68685

work page 2024

[40] [40]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang

work page

[41] [41]

Flexgen: High-throughput generative inference of large language models with a single gpu. InICML. 31094–31116

work page

[42] [42]

Xiaoniu Song, Zihang Zhong, Rong Chen, and Haibo Chen. 2024. Promoe: Fast moe-based llm serving using proactive caching. arXiv:2410.22134

work page arXiv 2024

[43] [43]

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. In ACM SOSP. 590–606

work page 2024

[44] [44]

Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, and Max Ryabinin. 2024. Specexec: Massively parallel speculative decoding for interactive llm inference on consumer devices.NIPS37 (2024), 16342–16368

work page 2024

[45] [45]

Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng- Ann Heng, Chao Li, and Minyi Guo. 2024. Hobbit: A mixed precision expert offloading system for fast moe inference. arXiv:2411.01433

work page arXiv 2024

[46] [46]

Wei Tao, Haocheng Lu, Xiaoyang Qu, Bin Zhang, Kai Lu, Jiguang Wan, and Jianzong Wang. 2025. MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts. arXiv:2506.07533

work page arXiv 2025

[47] [47]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023. Tabi: An efficient multi-level inference system for large language models. In EuroSys. 233–248

work page 2023

[49] [49]

Yuanxin Wei, Jiangsu Du, Jiazhi Jiang, Xiao Shi, Xianwei Zhang, Dan Huang, Nong Xiao, and Yutong Lu. 2024. APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes. InIEEE SC. 1–14

work page 2024

[50] [50]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al . 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv:1910.03771

work page internal anchor Pith review Pith/arXiv arXiv 2019

[51] [51]

Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2025. EdgeLLM: Fast On-Device LLM Inference With Speculative Decoding.IEEE TMC24, 4 (2025), 3256–3273

work page 2025

[52] [52]

Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, and Luo Mai. 2025. MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching. arXiv:2503.09716

work page arXiv 2025

[53] [53]

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchun- shu Zhou, and Yang You. 2024. OpenMoE: an early effort on open mixture-of-experts language models. InICML. 55625–55655

work page 2024

[54] [54]

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. 2025. MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache. arXiv:2401.14361

work page arXiv 2025

[55] [55]

Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, and Dhabaleswar K DK Panda. 2024. Exploiting inter-layer expert affinity for accelerating mixture-of-experts model inference. InIEEE IPDPS. 915–925

work page 2024

[56] [56]

Hanfei Yu, Xingqi Cui, Hong Zhang, and Hao Wang. 2025. fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving. arXiv:2502.05370

work page arXiv 2025

[57] [57]

Libo Zhang, Zhaoning Zhang, Baizhou Xu, Songzhu Mei, and Dong- sheng Li. 2025. Dovetail: A cpu/gpu heterogeneous speculative decod- ing for llm inference. InEMNLP. 1–13

work page 2025

[58] [58]

Yujie Zhang, Shivam Aggarwal, and Tulika Mitra. 2025. DAOP: Data- Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference. InIEEE DATE. 1–7

work page 2025

[59] [59]

Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, and Yang You. 2024. Hetegen: Efficient heterogeneous parallel inference for large language models on resource-constrained devices.MLSys6 (2024), 162–172

work page 2024

[60] [60]

Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, and Meng Li. 2024. AdapMoE: Adaptive sensitivity-based expert gating and management for efficient moe inference. InIEEE ICCAD. 1–9

work page 2024

[61] [61]

Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, and Meng Li. 2025. HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference. InDAC. 1–7

work page 2025