DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance
Pith reviewed 2026-05-18 18:28 UTC · model grok-4.3
The pith
DuoServe-MoE decouples prefill and decode phases to cut MoE LLM latency up to 7.55 times while keeping GPU memory low.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DuoServe-MoE is a serving system that decouples the prefill and decode phases of MoE inference and applies phase-specialized expert scheduling. In prefill it runs a two-stream CUDA pipeline that overlaps expert prefetching with non-MoE computation, shortening expert residency time and peak memory. In decode it runs a lightweight layer-level predictor, trained offline on activation traces, that prefetches only the experts the model is likely to need. This dual-phase design prevents the memory blowup or tail-latency inflation that a uniform policy produces.
What carries the argument
Phase-specialized expert prefetch and caching that uses a two-stream CUDA pipeline in prefill and an offline-trained layer-level predictor in decode to match the dense-sparse activation difference.
If this is right
- TTFT drops by up to 5.34 times relative to baselines that use one policy for both phases.
- End-to-end latency drops by up to 7.55 times while peak GPU memory stays low.
- Resource-constrained deployments can still meet strict latency SLOs for LLM-as-a-Service workloads.
- Expert weights no longer force a direct trade-off between memory footprint and tail latency.
Where Pith is reading between the lines
- The same split between dense and sparse phases could be applied to other sparse-activation architectures beyond current MoE designs.
- An online version of the predictor might further reduce the need for offline trace collection.
- The approach may let operators serve larger MoE models on the same number of GPUs by keeping memory headroom for additional requests.
Load-bearing premise
The difference in how many experts are active during prefill versus decode stays large and predictable enough that two separate policies are worth the added complexity.
What would settle it
A controlled run of the same MoE models under identical hardware and workloads in which a single uniform expert-loading policy matches or beats the reported TTFT and end-to-end numbers without exceeding the same memory cap.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly deployed as Internet/Web services (LLM-as-a-Service) with strict latency Service-Level Objectives (SLOs) under tight GPU memory budgets. Mixture-of-Experts (MoE) models improve quality and throughput via sparse expert activation, but serving them efficiently is challenging because expert weights dominate memory footprint and incur costly host--device transfers when offloaded. Moreover, MoE serving exhibits a phase disparity: the prefill phase tends to activate experts densely across many tokens, while the decode phase activates only a few experts per step. A uniform expert loading/caching policy across phases leads to either peak-memory blowup (prefill) or tail-latency inflation (decode). We present DuoServe-MoE, a QoS-oriented MoE serving system that decouples prefill and decode and applies phase-specialized expert scheduling. For prefill, DuoServe-MoE uses a two-stream CUDA pipeline to overlap expert prefetching with non-MoE computation, reducing expert residency time and peak GPU memory. For decode, it employs a lightweight layer-level predictor trained offline from activation traces to prefetch only likely experts without model changes. Experiments on representative MoE LLMs show that DuoServe-MoE improves TTFT by up to $5.34\times$ and end-to-end latency by up to $7.55\times$ over representative baselines, while maintaining low runtime GPU memory usage under resource-constrained deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes DuoServe-MoE, a QoS-oriented serving system for Mixture-of-Experts LLMs that decouples the prefill and decode phases to apply specialized expert prefetch and caching policies. For the prefill phase, it employs a two-stream CUDA pipeline to overlap expert prefetching with non-MoE computations. For the decode phase, it uses a lightweight layer-level predictor trained offline on activation traces. The paper reports experimental results showing up to 5.34× improvement in TTFT and 7.55× in end-to-end latency over baselines while maintaining low GPU memory usage.
Significance. This work tackles a practical challenge in deploying MoE models for latency-sensitive LLM-as-a-Service under memory constraints. The phase-specialized approach is a direct response to the observed disparity in expert activation patterns. If the results are reproducible and generalize, the speedups could meaningfully improve serving efficiency. The empirical evaluation against representative baselines is a positive aspect of the contribution.
major comments (1)
- [Evaluation] The central claim relies on the phase disparity between dense prefill and sparse decode activation being consistent enough to warrant separate policies. However, under continuous batching (common in production serving), a single decode step processes multiple sequences, which could increase the number of unique experts activated per layer and reduce sparsity. The manuscript should clarify whether experiments use continuous batching and provide data on expert activation sparsity or hit rates for the predictor in such settings. If not addressed, this weakens the justification for the dual-phase design and the reported gains.
minor comments (2)
- [Abstract] The abstract refers to 'representative baselines' without specifying them; naming the baselines (e.g., vLLM, TensorRT-LLM or specific MoE serving systems) would improve clarity.
- [§3] The description of the lightweight predictor could benefit from a small example or pseudocode to illustrate how it is trained and used at runtime.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work's significance and for the constructive feedback on the evaluation. We address the major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Evaluation] The central claim relies on the phase disparity between dense prefill and sparse decode activation being consistent enough to warrant separate policies. However, under continuous batching (common in production serving), a single decode step processes multiple sequences, which could increase the number of unique experts activated per layer and reduce sparsity. The manuscript should clarify whether experiments use continuous batching and provide data on expert activation sparsity or hit rates for the predictor in such settings. If not addressed, this weakens the justification for the dual-phase design and the reported gains.
Authors: We appreciate the referee highlighting the relevance of continuous batching to production MoE serving. Our experiments were conducted with continuous batching enabled during the decode phase, consistent with standard serving frameworks, to evaluate the system under realistic multi-sequence workloads. The core phase disparity still holds because prefill activates experts densely across a large number of tokens in a single forward pass, while batched decode activates experts on a per-token, per-sequence basis (typically top-2 experts per layer per token). This keeps the number of unique experts per layer far lower than in prefill even as batch size grows. To strengthen the manuscript, we will revise the evaluation section to explicitly document the continuous-batching configuration and add new analysis of expert sparsity and predictor hit rates across a range of batch sizes, drawn from our existing activation traces. These additions will confirm that the dual-phase policies retain their advantages under batched decode. revision: yes
Circularity Check
No circularity: empirical systems design with independent experimental validation
full rationale
The paper describes a practical serving system that decouples prefill and decode phases based on observed activation patterns in MoE models, implements specialized prefetching mechanisms (CUDA pipeline and offline-trained predictor), and reports measured speedups against external baselines. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. The phase disparity is stated as an empirical property of MoE inference rather than something derived from the proposed system itself. The work remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MoE serving exhibits a phase disparity where prefill activates experts densely and decode activates only a few per step
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MoE inference consists of two fundamentally different stages: a prefill stage where most experts are activated densely, and a decode stage where only a few experts are triggered sparsely.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
lightweight layer-level predictor trained offline from activation traces... popularity... inter-layer affinity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
-
Temporally Extended Mixture-of-Experts Models
Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
Reference graph
Works this paper leans on
-
[1]
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
OpenAI, Chatgpt (gpt-4),https://www.openai.com, 2024. Accessed: 2024-09-09
work page 2024
-
[3]
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, in: Interna- tional Conference on Learning Representations, 2017. URL: https://openreview.net/forum?id=B1ckMDqlg
work page 2017
-
[4]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017)
work page 2017
- [5]
-
[6]
A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al., Mixtral of experts, arXiv preprint arXiv:2401.04088 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, et al., Deepseekmoe: Towards ultimate expert spe- cialization in mixture-of-experts language models, arXiv preprint arXiv:2401.06066 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [8]
-
[9]
N. Dhar, B. Deng, D. Lo, X. Wu, L. Zhao, K. Suo, An empirical analysis and resource footprint study of deploying large language models on edge devices, in: Proceedings of the 2024 ACM Southeast Conference, 2024, pp. 69–76
work page 2024
-
[10]
D. Yu, L. Shen, H. Hao, W. Gong, H. Wu, J. Bian, L. Dai, H. Xiong, Moesys: A distributed and efficient mixture-of-experts training and in- ference system for internet services, IEEE Transactions on Services Computing (2024)
work page 2024
-
[11]
HuggingFace, Huggingface accelerate,https://huggingface.co/docs /accelerate/index, 2022
work page 2022
- [12]
- [13]
-
[14]
Z. Du, S. Li, Y. Wu, X. Jiang, J. Sun, Q. Zheng, Y. Wu, A. Li, H. Li, Y. Chen, Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-expertsmodels, Proceedings of MachineLearn- ing and Systems 6 (2024) 224–238. 25
work page 2024
-
[15]
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, I. Stoica, Efficient memory management for large language model serving with pagedattention, in: Proceedings of the 29th Sympo- sium on Operating Systems Principles, 2023, pp. 611–626
work page 2023
-
[16]
R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley, et al., Deepspeed-inference: enablingefficientinferenceoftransformermodelsatunprecedentedscale, in: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2022, pp. 1–15
work page 2022
-
[17]
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, Z. Chen, Gshard: Scaling giant mod- els with conditional computation and automatic sharding, in: In- ternational Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id=qrwe7XHTmYb
work page 2021
- [18]
-
[19]
Hochreiter, Long short-term memory, Neural Computation MIT- Press (1997)
S. Hochreiter, Long short-term memory, Neural Computation MIT- Press (1997)
work page 1997
-
[20]
J. Yao, Q. Anthony, A. Shafi, H. Subramoni, D. K. D. Panda, Exploiting inter-layer expert affinity for accelerating mixture-of-experts model in- ference, in: 2024 IEEE International Parallel andDistributed Processing Symposium (IPDPS), IEEE, 2024, pp. 915–925
work page 2024
-
[21]
U. Ruby, V. Yendapalli, Binary cross entropy with deep learning tech- nique for image classification, International Journal of Advanced Trends in Computer Science and Engineering 9 (2020)
work page 2020
-
[22]
NVIDIA Corporation, Nvidia cuda c/c++ streams and concurrency, 2021. URL: http://on-demand.gputechconf.com/gtc-express/2011/presen tations/StreamsAndConcurrencyWebinar.pdf
work page 2021
-
[23]
URL: https://developer.nvidia.com/cuda-toolkit, version 12.0
NVIDIA Corporation, Nvidia cuda toolkit, 2020. URL: https://developer.nvidia.com/cuda-toolkit, version 12.0. 26
work page 2020
-
[24]
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, S. Han, Awq: Activation-aware weight quantization for on-device llm compression and acceleration, Proceedings of Machine Learning and Systems 6 (2024) 87–100
work page 2024
-
[25]
Know What You Don't Know: Unanswerable Questions for SQuAD
P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswer- able questions for squad, arXiv preprint arXiv:1806.03822 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[26]
A. Mitra, H. Khanpour, C. Rosset, A. Awadallah, Orca-math: Un- locking the potential of slms in grade school math, arXiv preprint arXiv:2402.14830 (2024)
-
[27]
E. Frantar, D. Alistarh, Qmoe: Sub-1-bit compression of trillion param- eter models, Proceedings of Machine Learning and Systems 6 (2024) 439–451
work page 2024
- [28]
-
[29]
E. Frantar, D. Alistarh, Sparsegpt: Massive language models can be accurately pruned in one-shot, in: International Conference on Machine Learning, PMLR, 2023, pp. 10323–10337
work page 2023
-
[30]
M. Sun, Z. Liu, A. Bair, J. Z. Kolter, A simple and effec- tive pruning approach for large language models, in: The Twelfth International Conference on Learning Representations, 2024. URL: https://openreview.net/forum?id=PxoFut3dWW
work page 2024
-
[31]
Y. Fu, H. Peng, L. Ou, A. Sabharwal, T. Khot, Specializing smaller language models towards multi-step reasoning, in: International Con- ference on Machine Learning, PMLR, 2023, pp. 10421–10430
work page 2023
-
[32]
M. Wu, A. Waheed, C. Zhang, M. Abdul-Mageed, A. F. Aji, LaMini- LM: A diverse herd of distilled models from large-scale instructions, in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Con- ference of the European Chapter of the Association for Computa- tional Linguistics (Volume 1: Long Papers), Association for Com- putational Linguistics, St. Juli...
work page 2024
-
[33]
E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, OPTQ: Accu- rate quantization for generative pre-trained transformers, in: The Eleventh International Conference on Learning Representations, 2023. URL:https://openreview.net/forum?id=tcbBPnfwxS
work page 2023
- [34]
-
[35]
Z. Xuan, B. Jia, H. Zhou, Z. Liu, S. Cheng, Y. You, Hetegen: Efficient heterogeneous parallel inference for large language models on resource- constrained devices, Proceedings of Machine Learning and Systems 6 (2024) 162–172
work page 2024
-
[36]
J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, Y. He, Zero-offload: Democratizing billion-scale model training, in: 2021USENIXAnnualTechnicalConference(USENIXATC 21), 2021, pp. 551–564
work page 2021
- [37]
-
[38]
J. Li, Y. Jiang, Y. Zhu, C. Wang, H. Xu, Accelerating distributed moe training and inference with lina, in: 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 945–959
work page 2023
-
[39]
S. Shi, X. Pan, Q. Wang, C. Liu, X. Ren, Z. Hu, Y. Yang, B. Li, X. Chu, Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling, in: Proceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 236–249
work page 2024
-
[40]
R. Kong, Y. Li, Q. Feng, W. Wang, X. Ye, Y. Ouyang, L. Kong, Y. Liu, Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget, in: Proceedings of the 62nd Annual Meeting of theAssociationforComputationalLinguistics(Volume1: LongPapers), 2024, pp. 6710–6720. 28
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.