pith. sign in

arxiv: 2509.07379 · v2 · submitted 2025-09-09 · 💻 cs.DC

DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance

Pith reviewed 2026-05-18 18:28 UTC · model grok-4.3

classification 💻 cs.DC
keywords mixture of expertsLLM servingexpert prefetchprefill decode phasesGPU memory managementinference latencyQoS assurancecaching policies
0
0 comments X

The pith

DuoServe-MoE decouples prefill and decode phases to cut MoE LLM latency up to 7.55 times while keeping GPU memory low.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts LLMs activate many experts when processing an input prompt but only a few when generating each new token. A single expert loading policy therefore either exhausts memory in the first phase or leaves the second phase waiting on slow transfers. DuoServe-MoE splits the two phases and gives each its own specialized schedule. Prefill uses a two-stream CUDA pipeline that prefetches experts while other work runs; decode uses a small predictor trained on past activations to load only the experts that are likely next. The result is faster first-token and total response times without raising peak memory under tight GPU budgets.

Core claim

DuoServe-MoE is a serving system that decouples the prefill and decode phases of MoE inference and applies phase-specialized expert scheduling. In prefill it runs a two-stream CUDA pipeline that overlaps expert prefetching with non-MoE computation, shortening expert residency time and peak memory. In decode it runs a lightweight layer-level predictor, trained offline on activation traces, that prefetches only the experts the model is likely to need. This dual-phase design prevents the memory blowup or tail-latency inflation that a uniform policy produces.

What carries the argument

Phase-specialized expert prefetch and caching that uses a two-stream CUDA pipeline in prefill and an offline-trained layer-level predictor in decode to match the dense-sparse activation difference.

If this is right

  • TTFT drops by up to 5.34 times relative to baselines that use one policy for both phases.
  • End-to-end latency drops by up to 7.55 times while peak GPU memory stays low.
  • Resource-constrained deployments can still meet strict latency SLOs for LLM-as-a-Service workloads.
  • Expert weights no longer force a direct trade-off between memory footprint and tail latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same split between dense and sparse phases could be applied to other sparse-activation architectures beyond current MoE designs.
  • An online version of the predictor might further reduce the need for offline trace collection.
  • The approach may let operators serve larger MoE models on the same number of GPUs by keeping memory headroom for additional requests.

Load-bearing premise

The difference in how many experts are active during prefill versus decode stays large and predictable enough that two separate policies are worth the added complexity.

What would settle it

A controlled run of the same MoE models under identical hardware and workloads in which a single uniform expert-loading policy matches or beats the reported TTFT and end-to-end numbers without exceeding the same memory cap.

Figures

Figures reproduced from arXiv: 2509.07379 by Dong Yuan, Grant Pinkert, Nan Yang, Yanli Li, Yuning Zhang.

Figure 1
Figure 1. Figure 1: Mixture of Experts Architecture in Mixtral. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Popularity and Affinity in MoE-based LLMs. Darker colors represent higher [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: System Overview. The Expert Dispatcher handles expert weight prefetching [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prefill and Decode scheduling mechanism in DuoServe-MoE. The top part of [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The performance evaluation across different test sets. The red bar represents [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The Cumulative Distribution Function (CDF) of decoding throughput. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly deployed as Internet/Web services (LLM-as-a-Service) with strict latency Service-Level Objectives (SLOs) under tight GPU memory budgets. Mixture-of-Experts (MoE) models improve quality and throughput via sparse expert activation, but serving them efficiently is challenging because expert weights dominate memory footprint and incur costly host--device transfers when offloaded. Moreover, MoE serving exhibits a phase disparity: the prefill phase tends to activate experts densely across many tokens, while the decode phase activates only a few experts per step. A uniform expert loading/caching policy across phases leads to either peak-memory blowup (prefill) or tail-latency inflation (decode). We present DuoServe-MoE, a QoS-oriented MoE serving system that decouples prefill and decode and applies phase-specialized expert scheduling. For prefill, DuoServe-MoE uses a two-stream CUDA pipeline to overlap expert prefetching with non-MoE computation, reducing expert residency time and peak GPU memory. For decode, it employs a lightweight layer-level predictor trained offline from activation traces to prefetch only likely experts without model changes. Experiments on representative MoE LLMs show that DuoServe-MoE improves TTFT by up to $5.34\times$ and end-to-end latency by up to $7.55\times$ over representative baselines, while maintaining low runtime GPU memory usage under resource-constrained deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript describes DuoServe-MoE, a QoS-oriented serving system for Mixture-of-Experts LLMs that decouples the prefill and decode phases to apply specialized expert prefetch and caching policies. For the prefill phase, it employs a two-stream CUDA pipeline to overlap expert prefetching with non-MoE computations. For the decode phase, it uses a lightweight layer-level predictor trained offline on activation traces. The paper reports experimental results showing up to 5.34× improvement in TTFT and 7.55× in end-to-end latency over baselines while maintaining low GPU memory usage.

Significance. This work tackles a practical challenge in deploying MoE models for latency-sensitive LLM-as-a-Service under memory constraints. The phase-specialized approach is a direct response to the observed disparity in expert activation patterns. If the results are reproducible and generalize, the speedups could meaningfully improve serving efficiency. The empirical evaluation against representative baselines is a positive aspect of the contribution.

major comments (1)
  1. [Evaluation] The central claim relies on the phase disparity between dense prefill and sparse decode activation being consistent enough to warrant separate policies. However, under continuous batching (common in production serving), a single decode step processes multiple sequences, which could increase the number of unique experts activated per layer and reduce sparsity. The manuscript should clarify whether experiments use continuous batching and provide data on expert activation sparsity or hit rates for the predictor in such settings. If not addressed, this weakens the justification for the dual-phase design and the reported gains.
minor comments (2)
  1. [Abstract] The abstract refers to 'representative baselines' without specifying them; naming the baselines (e.g., vLLM, TensorRT-LLM or specific MoE serving systems) would improve clarity.
  2. [§3] The description of the lightweight predictor could benefit from a small example or pseudocode to illustrate how it is trained and used at runtime.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the constructive feedback on the evaluation. We address the major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Evaluation] The central claim relies on the phase disparity between dense prefill and sparse decode activation being consistent enough to warrant separate policies. However, under continuous batching (common in production serving), a single decode step processes multiple sequences, which could increase the number of unique experts activated per layer and reduce sparsity. The manuscript should clarify whether experiments use continuous batching and provide data on expert activation sparsity or hit rates for the predictor in such settings. If not addressed, this weakens the justification for the dual-phase design and the reported gains.

    Authors: We appreciate the referee highlighting the relevance of continuous batching to production MoE serving. Our experiments were conducted with continuous batching enabled during the decode phase, consistent with standard serving frameworks, to evaluate the system under realistic multi-sequence workloads. The core phase disparity still holds because prefill activates experts densely across a large number of tokens in a single forward pass, while batched decode activates experts on a per-token, per-sequence basis (typically top-2 experts per layer per token). This keeps the number of unique experts per layer far lower than in prefill even as batch size grows. To strengthen the manuscript, we will revise the evaluation section to explicitly document the continuous-batching configuration and add new analysis of expert sparsity and predictor hit rates across a range of batch sizes, drawn from our existing activation traces. These additions will confirm that the dual-phase policies retain their advantages under batched decode. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems design with independent experimental validation

full rationale

The paper describes a practical serving system that decouples prefill and decode phases based on observed activation patterns in MoE models, implements specialized prefetching mechanisms (CUDA pipeline and offline-trained predictor), and reports measured speedups against external baselines. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. The phase disparity is stated as an empirical property of MoE inference rather than something derived from the proposed system itself. The work remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of consistent prefill/decode phase disparity and on the empirical effectiveness of the introduced scheduling policies; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption MoE serving exhibits a phase disparity where prefill activates experts densely and decode activates only a few per step
    Invoked when stating that a uniform policy leads to either peak-memory blowup or tail-latency inflation.

pith-pipeline@v0.9.0 · 5803 in / 1219 out tokens · 34535 ms · 2026-05-18T18:28:56.242068+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

    cs.LG 2026-05 unverdicted novelty 6.0

    VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.

  2. Temporally Extended Mixture-of-Experts Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223 (2023)

  2. [2]

    Accessed: 2024-09-09

    OpenAI, Chatgpt (gpt-4),https://www.openai.com, 2024. Accessed: 2024-09-09

  3. [3]

    Shazeer, A

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, in: Interna- tional Conference on Learning Representations, 2017. URL: https://openreview.net/forum?id=B1ckMDqlg

  4. [4]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017)

  5. [5]

    Fedus, B

    W. Fedus, B. Zoph, N. Shazeer, Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, Journal of Machine Learning Research 23 (2022) 1–39. 24

  6. [6]

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al., Mixtral of experts, arXiv preprint arXiv:2401.04088 (2024)

  7. [7]

    D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, et al., Deepseekmoe: Towards ultimate expert spe- cialization in mixture-of-experts language models, arXiv preprint arXiv:2401.06066 (2024)

  8. [8]

    G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, K. Huang, Mobile edge intelligence for large language models: A contemporary survey, arXiv preprint arXiv:2407.18921 (2024)

  9. [9]

    N. Dhar, B. Deng, D. Lo, X. Wu, L. Zhao, K. Suo, An empirical analysis and resource footprint study of deploying large language models on edge devices, in: Proceedings of the 2024 ACM Southeast Conference, 2024, pp. 69–76

  10. [10]

    D. Yu, L. Shen, H. Hao, W. Gong, H. Wu, J. Bian, L. Dai, H. Xiong, Moesys: A distributed and efficient mixture-of-experts training and in- ference system for internet services, IEEE Transactions on Services Computing (2024)

  11. [11]

    HuggingFace, Huggingface accelerate,https://huggingface.co/docs /accelerate/index, 2022

  12. [12]

    Hwang, J

    R. Hwang, J. Wei, S. Cao, C. Hwang, X. Tang, T. Cao, M. Yang, Pre- gatedmoe: Analgorithm-systemco-designforfastandscalablemixture- of-expert inference, in: 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), IEEE, 2024, pp. 1018– 1031

  13. [13]

    R. Yi, L. Guo, S. Wei, A. Zhou, S. Wang, M. Xu, Edgemoe: Fast on- device inference of moe-based large language models, arXiv preprint arXiv:2308.14352 (2023)

  14. [14]

    Z. Du, S. Li, Y. Wu, X. Jiang, J. Sun, Q. Zheng, Y. Wu, A. Li, H. Li, Y. Chen, Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-expertsmodels, Proceedings of MachineLearn- ing and Systems 6 (2024) 224–238. 25

  15. [15]

    W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, I. Stoica, Efficient memory management for large language model serving with pagedattention, in: Proceedings of the 29th Sympo- sium on Operating Systems Principles, 2023, pp. 611–626

  16. [16]

    R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley, et al., Deepspeed-inference: enablingefficientinferenceoftransformermodelsatunprecedentedscale, in: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2022, pp. 1–15

  17. [17]

    Lepikhin, H

    D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, Z. Chen, Gshard: Scaling giant mod- els with conditional computation and automatic sharding, in: In- ternational Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id=qrwe7XHTmYb

  18. [18]

    L. Xue, Y. Fu, Z. Lu, L. Mai, M. Marina, Moe-infinity: Activation- aware expert offloading for efficient moe serving, arXiv preprint arXiv:2401.14361 (2024)

  19. [19]

    Hochreiter, Long short-term memory, Neural Computation MIT- Press (1997)

    S. Hochreiter, Long short-term memory, Neural Computation MIT- Press (1997)

  20. [20]

    J. Yao, Q. Anthony, A. Shafi, H. Subramoni, D. K. D. Panda, Exploiting inter-layer expert affinity for accelerating mixture-of-experts model in- ference, in: 2024 IEEE International Parallel andDistributed Processing Symposium (IPDPS), IEEE, 2024, pp. 915–925

  21. [21]

    U. Ruby, V. Yendapalli, Binary cross entropy with deep learning tech- nique for image classification, International Journal of Advanced Trends in Computer Science and Engineering 9 (2020)

  22. [22]

    URL: http://on-demand.gputechconf.com/gtc-express/2011/presen tations/StreamsAndConcurrencyWebinar.pdf

    NVIDIA Corporation, Nvidia cuda c/c++ streams and concurrency, 2021. URL: http://on-demand.gputechconf.com/gtc-express/2011/presen tations/StreamsAndConcurrencyWebinar.pdf

  23. [23]

    URL: https://developer.nvidia.com/cuda-toolkit, version 12.0

    NVIDIA Corporation, Nvidia cuda toolkit, 2020. URL: https://developer.nvidia.com/cuda-toolkit, version 12.0. 26

  24. [24]

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, S. Han, Awq: Activation-aware weight quantization for on-device llm compression and acceleration, Proceedings of Machine Learning and Systems 6 (2024) 87–100

  25. [25]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswer- able questions for squad, arXiv preprint arXiv:1806.03822 (2018)

  26. [26]

    Orca-math: Unlocking the potential of SLMs in grade school math.arXiv preprint, arXiv:2402.14830, 2024

    A. Mitra, H. Khanpour, C. Rosset, A. Awadallah, Orca-math: Un- locking the potential of slms in grade school math, arXiv preprint arXiv:2402.14830 (2024)

  27. [27]

    Frantar, D

    E. Frantar, D. Alistarh, Qmoe: Sub-1-bit compression of trillion param- eter models, Proceedings of Machine Learning and Systems 6 (2024) 439–451

  28. [28]

    T. Chen, S. Huang, Y. Xie, B. Jiao, D. Jiang, H. Zhou, J. Li, F.Wei, Task-specificexpertpruningforsparsemixture-of-experts, arXiv preprint arXiv:2206.00277 (2022)

  29. [29]

    Frantar, D

    E. Frantar, D. Alistarh, Sparsegpt: Massive language models can be accurately pruned in one-shot, in: International Conference on Machine Learning, PMLR, 2023, pp. 10323–10337

  30. [30]

    M. Sun, Z. Liu, A. Bair, J. Z. Kolter, A simple and effec- tive pruning approach for large language models, in: The Twelfth International Conference on Learning Representations, 2024. URL: https://openreview.net/forum?id=PxoFut3dWW

  31. [31]

    Y. Fu, H. Peng, L. Ou, A. Sabharwal, T. Khot, Specializing smaller language models towards multi-step reasoning, in: International Con- ference on Machine Learning, PMLR, 2023, pp. 10421–10430

  32. [32]

    M. Wu, A. Waheed, C. Zhang, M. Abdul-Mageed, A. F. Aji, LaMini- LM: A diverse herd of distilled models from large-scale instructions, in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Con- ference of the European Chapter of the Association for Computa- tional Linguistics (Volume 1: Long Papers), Association for Com- putational Linguistics, St. Juli...

  33. [33]

    Frantar, S

    E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, OPTQ: Accu- rate quantization for generative pre-trained transformers, in: The Eleventh International Conference on Learning Representations, 2023. URL:https://openreview.net/forum?id=tcbBPnfwxS

  34. [34]

    Sheng, L

    Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, C. Zhang, Flexgen: High-throughput generative in- ference of large language models with a single gpu, in: International Conference on Machine Learning, PMLR, 2023, pp. 31094–31116

  35. [35]

    Z. Xuan, B. Jia, H. Zhou, Z. Liu, S. Cheng, Y. You, Hetegen: Efficient heterogeneous parallel inference for large language models on resource- constrained devices, Proceedings of Machine Learning and Systems 6 (2024) 162–172

  36. [36]

    J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, Y. He, Zero-offload: Democratizing billion-scale model training, in: 2021USENIXAnnualTechnicalConference(USENIXATC 21), 2021, pp. 551–564

  37. [37]

    Adnan, A

    M. Adnan, A. Arunkumar, G. Jain, P. Nair, I. Soloveychik, P. Kamath, Keyformer: Kv cache reduction through key tokens selection for efficient generative inference, Proceedings of Machine Learning and Systems 6 (2024) 114–127

  38. [38]

    J. Li, Y. Jiang, Y. Zhu, C. Wang, H. Xu, Accelerating distributed moe training and inference with lina, in: 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 945–959

  39. [39]

    S. Shi, X. Pan, Q. Wang, C. Liu, X. Ren, Z. Hu, Y. Yang, B. Li, X. Chu, Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling, in: Proceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 236–249

  40. [40]

    R. Kong, Y. Li, Q. Feng, W. Wang, X. Ye, Y. Ouyang, L. Kong, Y. Liu, Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget, in: Proceedings of the 62nd Annual Meeting of theAssociationforComputationalLinguistics(Volume1: LongPapers), 2024, pp. 6710–6720. 28