DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance

Dong Yuan; Grant Pinkert; Nan Yang; Yanli Li; Yuning Zhang

arxiv: 2509.07379 · v2 · submitted 2025-09-09 · 💻 cs.DC

DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance

Yuning Zhang , Grant Pinkert , Nan Yang , Yanli Li , Dong Yuan This is my paper

Pith reviewed 2026-05-18 18:28 UTC · model grok-4.3

classification 💻 cs.DC

keywords mixture of expertsLLM servingexpert prefetchprefill decode phasesGPU memory managementinference latencyQoS assurancecaching policies

0 comments

The pith

DuoServe-MoE decouples prefill and decode phases to cut MoE LLM latency up to 7.55 times while keeping GPU memory low.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts LLMs activate many experts when processing an input prompt but only a few when generating each new token. A single expert loading policy therefore either exhausts memory in the first phase or leaves the second phase waiting on slow transfers. DuoServe-MoE splits the two phases and gives each its own specialized schedule. Prefill uses a two-stream CUDA pipeline that prefetches experts while other work runs; decode uses a small predictor trained on past activations to load only the experts that are likely next. The result is faster first-token and total response times without raising peak memory under tight GPU budgets.

Core claim

DuoServe-MoE is a serving system that decouples the prefill and decode phases of MoE inference and applies phase-specialized expert scheduling. In prefill it runs a two-stream CUDA pipeline that overlaps expert prefetching with non-MoE computation, shortening expert residency time and peak memory. In decode it runs a lightweight layer-level predictor, trained offline on activation traces, that prefetches only the experts the model is likely to need. This dual-phase design prevents the memory blowup or tail-latency inflation that a uniform policy produces.

What carries the argument

Phase-specialized expert prefetch and caching that uses a two-stream CUDA pipeline in prefill and an offline-trained layer-level predictor in decode to match the dense-sparse activation difference.

If this is right

TTFT drops by up to 5.34 times relative to baselines that use one policy for both phases.
End-to-end latency drops by up to 7.55 times while peak GPU memory stays low.
Resource-constrained deployments can still meet strict latency SLOs for LLM-as-a-Service workloads.
Expert weights no longer force a direct trade-off between memory footprint and tail latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same split between dense and sparse phases could be applied to other sparse-activation architectures beyond current MoE designs.
An online version of the predictor might further reduce the need for offline trace collection.
The approach may let operators serve larger MoE models on the same number of GPUs by keeping memory headroom for additional requests.

Load-bearing premise

The difference in how many experts are active during prefill versus decode stays large and predictable enough that two separate policies are worth the added complexity.

What would settle it

A controlled run of the same MoE models under identical hardware and workloads in which a single uniform expert-loading policy matches or beats the reported TTFT and end-to-end numbers without exceeding the same memory cap.

Figures

Figures reproduced from arXiv: 2509.07379 by Dong Yuan, Grant Pinkert, Nan Yang, Yanli Li, Yuning Zhang.

**Figure 2.** Figure 2: Popularity and Affinity in MoE-based LLMs. Darker colors represent higher [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: System Overview. The Expert Dispatcher handles expert weight prefetching [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Prefill and Decode scheduling mechanism in DuoServe-MoE. The top part of [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: The performance evaluation across different test sets. The red bar represents [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: The Cumulative Distribution Function (CDF) of decoding throughput. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly deployed as Internet/Web services (LLM-as-a-Service) with strict latency Service-Level Objectives (SLOs) under tight GPU memory budgets. Mixture-of-Experts (MoE) models improve quality and throughput via sparse expert activation, but serving them efficiently is challenging because expert weights dominate memory footprint and incur costly host--device transfers when offloaded. Moreover, MoE serving exhibits a phase disparity: the prefill phase tends to activate experts densely across many tokens, while the decode phase activates only a few experts per step. A uniform expert loading/caching policy across phases leads to either peak-memory blowup (prefill) or tail-latency inflation (decode). We present DuoServe-MoE, a QoS-oriented MoE serving system that decouples prefill and decode and applies phase-specialized expert scheduling. For prefill, DuoServe-MoE uses a two-stream CUDA pipeline to overlap expert prefetching with non-MoE computation, reducing expert residency time and peak GPU memory. For decode, it employs a lightweight layer-level predictor trained offline from activation traces to prefetch only likely experts without model changes. Experiments on representative MoE LLMs show that DuoServe-MoE improves TTFT by up to $5.34\times$ and end-to-end latency by up to $7.55\times$ over representative baselines, while maintaining low runtime GPU memory usage under resource-constrained deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DuoServe-MoE splits prefill and decode expert handling with a CUDA pipeline and offline predictor, which addresses a real serving pain point but may lose ground if continuous batching erodes decode sparsity.

read the letter

Colleague, the main thing here is that the paper splits expert prefetch and caching into phase-specific policies instead of using one uniform approach for MoE models. Prefill gets a two-stream CUDA pipeline that overlaps prefetch with non-MoE work to cut peak memory and TTFT. Decode gets a lightweight layer-level predictor trained offline on activation traces to bring in only the likely experts. This targets the dense activation across tokens in prefill versus the sparse per-step activation in decode, and the abstract reports up to 5.34× better TTFT and 7.55× better end-to-end latency with low runtime GPU memory. That combination of techniques is the concrete new piece, and it is a practical response to the memory versus latency tradeoff that uniform policies create under tight GPU budgets. The work is clearly aimed at production serving constraints rather than pure algorithmic novelty. The experiments are presented against representative baselines and claim measurable gains without model changes, which is the sort of engineering detail that matters for deployment. A soft spot is the assumption that the phase disparity stays strong enough to justify separate policies. In continuous batching a single decode step can handle many sequences at once, which raises the number of unique experts activated per layer and could make the predictor miss more often or force a larger cache. The paper would be stronger with explicit results or discussion showing how the design behaves under realistic continuous batching rather than isolated phase tests. The math and data here are empirical rather than derived, so the claims rest on the experimental setup being representative. Overall this is for systems researchers and engineers who build LLM inference stacks, especially those working with MoE models on constrained hardware. A reader looking for concrete scheduling ideas would get value from the CUDA pipeline and the trace-based predictor. It deserves a serious referee because the problem is current, the techniques are implementable, and the reported gains are large enough to warrant checking the details and the batching concern.

Referee Report

1 major / 2 minor

Summary. The manuscript describes DuoServe-MoE, a QoS-oriented serving system for Mixture-of-Experts LLMs that decouples the prefill and decode phases to apply specialized expert prefetch and caching policies. For the prefill phase, it employs a two-stream CUDA pipeline to overlap expert prefetching with non-MoE computations. For the decode phase, it uses a lightweight layer-level predictor trained offline on activation traces. The paper reports experimental results showing up to 5.34× improvement in TTFT and 7.55× in end-to-end latency over baselines while maintaining low GPU memory usage.

Significance. This work tackles a practical challenge in deploying MoE models for latency-sensitive LLM-as-a-Service under memory constraints. The phase-specialized approach is a direct response to the observed disparity in expert activation patterns. If the results are reproducible and generalize, the speedups could meaningfully improve serving efficiency. The empirical evaluation against representative baselines is a positive aspect of the contribution.

major comments (1)

[Evaluation] The central claim relies on the phase disparity between dense prefill and sparse decode activation being consistent enough to warrant separate policies. However, under continuous batching (common in production serving), a single decode step processes multiple sequences, which could increase the number of unique experts activated per layer and reduce sparsity. The manuscript should clarify whether experiments use continuous batching and provide data on expert activation sparsity or hit rates for the predictor in such settings. If not addressed, this weakens the justification for the dual-phase design and the reported gains.

minor comments (2)

[Abstract] The abstract refers to 'representative baselines' without specifying them; naming the baselines (e.g., vLLM, TensorRT-LLM or specific MoE serving systems) would improve clarity.
[§3] The description of the lightweight predictor could benefit from a small example or pseudocode to illustrate how it is trained and used at runtime.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the constructive feedback on the evaluation. We address the major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Evaluation] The central claim relies on the phase disparity between dense prefill and sparse decode activation being consistent enough to warrant separate policies. However, under continuous batching (common in production serving), a single decode step processes multiple sequences, which could increase the number of unique experts activated per layer and reduce sparsity. The manuscript should clarify whether experiments use continuous batching and provide data on expert activation sparsity or hit rates for the predictor in such settings. If not addressed, this weakens the justification for the dual-phase design and the reported gains.

Authors: We appreciate the referee highlighting the relevance of continuous batching to production MoE serving. Our experiments were conducted with continuous batching enabled during the decode phase, consistent with standard serving frameworks, to evaluate the system under realistic multi-sequence workloads. The core phase disparity still holds because prefill activates experts densely across a large number of tokens in a single forward pass, while batched decode activates experts on a per-token, per-sequence basis (typically top-2 experts per layer per token). This keeps the number of unique experts per layer far lower than in prefill even as batch size grows. To strengthen the manuscript, we will revise the evaluation section to explicitly document the continuous-batching configuration and add new analysis of expert sparsity and predictor hit rates across a range of batch sizes, drawn from our existing activation traces. These additions will confirm that the dual-phase policies retain their advantages under batched decode. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems design with independent experimental validation

full rationale

The paper describes a practical serving system that decouples prefill and decode phases based on observed activation patterns in MoE models, implements specialized prefetching mechanisms (CUDA pipeline and offline-trained predictor), and reports measured speedups against external baselines. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. The phase disparity is stated as an empirical property of MoE inference rather than something derived from the proposed system itself. The work remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of consistent prefill/decode phase disparity and on the empirical effectiveness of the introduced scheduling policies; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption MoE serving exhibits a phase disparity where prefill activates experts densely and decode activates only a few per step
Invoked when stating that a uniform policy leads to either peak-memory blowup or tail-latency inflation.

pith-pipeline@v0.9.0 · 5803 in / 1219 out tokens · 34535 ms · 2026-05-18T18:28:56.242068+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MoE inference consists of two fundamentally different stages: a prefill stage where most experts are activated densely, and a decode stage where only a few experts are triggered sparsely.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight layer-level predictor trained offline from activation traces... popularity... inter-layer affinity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
cs.LG 2026-05 unverdicted novelty 6.0

VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
Temporally Extended Mixture-of-Experts Models
cs.LG 2026-04 unverdicted novelty 6.0

Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Accessed: 2024-09-09

OpenAI, Chatgpt (gpt-4),https://www.openai.com, 2024. Accessed: 2024-09-09

work page 2024
[3]

Shazeer, A

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, in: Interna- tional Conference on Learning Representations, 2017. URL: https://openreview.net/forum?id=B1ckMDqlg

work page 2017
[4]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017)

work page 2017
[5]

Fedus, B

W. Fedus, B. Zoph, N. Shazeer, Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, Journal of Machine Learning Research 23 (2022) 1–39. 24

work page 2022
[6]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al., Mixtral of experts, arXiv preprint arXiv:2401.04088 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, et al., Deepseekmoe: Towards ultimate expert spe- cialization in mixture-of-experts language models, arXiv preprint arXiv:2401.06066 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, K. Huang, Mobile edge intelligence for large language models: A contemporary survey, arXiv preprint arXiv:2407.18921 (2024)

work page arXiv 2024
[9]

N. Dhar, B. Deng, D. Lo, X. Wu, L. Zhao, K. Suo, An empirical analysis and resource footprint study of deploying large language models on edge devices, in: Proceedings of the 2024 ACM Southeast Conference, 2024, pp. 69–76

work page 2024
[10]

D. Yu, L. Shen, H. Hao, W. Gong, H. Wu, J. Bian, L. Dai, H. Xiong, Moesys: A distributed and efficient mixture-of-experts training and in- ference system for internet services, IEEE Transactions on Services Computing (2024)

work page 2024
[11]

HuggingFace, Huggingface accelerate,https://huggingface.co/docs /accelerate/index, 2022

work page 2022
[12]

Hwang, J

R. Hwang, J. Wei, S. Cao, C. Hwang, X. Tang, T. Cao, M. Yang, Pre- gatedmoe: Analgorithm-systemco-designforfastandscalablemixture- of-expert inference, in: 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), IEEE, 2024, pp. 1018– 1031

work page 2024
[13]

R. Yi, L. Guo, S. Wei, A. Zhou, S. Wang, M. Xu, Edgemoe: Fast on- device inference of moe-based large language models, arXiv preprint arXiv:2308.14352 (2023)

work page arXiv 2023
[14]

Z. Du, S. Li, Y. Wu, X. Jiang, J. Sun, Q. Zheng, Y. Wu, A. Li, H. Li, Y. Chen, Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-expertsmodels, Proceedings of MachineLearn- ing and Systems 6 (2024) 224–238. 25

work page 2024
[15]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, I. Stoica, Efficient memory management for large language model serving with pagedattention, in: Proceedings of the 29th Sympo- sium on Operating Systems Principles, 2023, pp. 611–626

work page 2023
[16]

R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley, et al., Deepspeed-inference: enablingefficientinferenceoftransformermodelsatunprecedentedscale, in: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2022, pp. 1–15

work page 2022
[17]

Lepikhin, H

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, Z. Chen, Gshard: Scaling giant mod- els with conditional computation and automatic sharding, in: In- ternational Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id=qrwe7XHTmYb

work page 2021
[18]

L. Xue, Y. Fu, Z. Lu, L. Mai, M. Marina, Moe-infinity: Activation- aware expert offloading for efficient moe serving, arXiv preprint arXiv:2401.14361 (2024)

work page arXiv 2024
[19]

Hochreiter, Long short-term memory, Neural Computation MIT- Press (1997)

S. Hochreiter, Long short-term memory, Neural Computation MIT- Press (1997)

work page 1997
[20]

J. Yao, Q. Anthony, A. Shafi, H. Subramoni, D. K. D. Panda, Exploiting inter-layer expert affinity for accelerating mixture-of-experts model in- ference, in: 2024 IEEE International Parallel andDistributed Processing Symposium (IPDPS), IEEE, 2024, pp. 915–925

work page 2024
[21]

U. Ruby, V. Yendapalli, Binary cross entropy with deep learning tech- nique for image classification, International Journal of Advanced Trends in Computer Science and Engineering 9 (2020)

work page 2020
[22]

URL: http://on-demand.gputechconf.com/gtc-express/2011/presen tations/StreamsAndConcurrencyWebinar.pdf

NVIDIA Corporation, Nvidia cuda c/c++ streams and concurrency, 2021. URL: http://on-demand.gputechconf.com/gtc-express/2011/presen tations/StreamsAndConcurrencyWebinar.pdf

work page 2021
[23]

URL: https://developer.nvidia.com/cuda-toolkit, version 12.0

NVIDIA Corporation, Nvidia cuda toolkit, 2020. URL: https://developer.nvidia.com/cuda-toolkit, version 12.0. 26

work page 2020
[24]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, S. Han, Awq: Activation-aware weight quantization for on-device llm compression and acceleration, Proceedings of Machine Learning and Systems 6 (2024) 87–100

work page 2024
[25]

Know What You Don't Know: Unanswerable Questions for SQuAD

P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswer- able questions for squad, arXiv preprint arXiv:1806.03822 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Orca-math: Unlocking the potential of SLMs in grade school math.arXiv preprint, arXiv:2402.14830, 2024

A. Mitra, H. Khanpour, C. Rosset, A. Awadallah, Orca-math: Un- locking the potential of slms in grade school math, arXiv preprint arXiv:2402.14830 (2024)

work page arXiv 2024
[27]

Frantar, D

E. Frantar, D. Alistarh, Qmoe: Sub-1-bit compression of trillion param- eter models, Proceedings of Machine Learning and Systems 6 (2024) 439–451

work page 2024
[28]

T. Chen, S. Huang, Y. Xie, B. Jiao, D. Jiang, H. Zhou, J. Li, F.Wei, Task-specificexpertpruningforsparsemixture-of-experts, arXiv preprint arXiv:2206.00277 (2022)

work page arXiv 2022
[29]

Frantar, D

E. Frantar, D. Alistarh, Sparsegpt: Massive language models can be accurately pruned in one-shot, in: International Conference on Machine Learning, PMLR, 2023, pp. 10323–10337

work page 2023
[30]

M. Sun, Z. Liu, A. Bair, J. Z. Kolter, A simple and effec- tive pruning approach for large language models, in: The Twelfth International Conference on Learning Representations, 2024. URL: https://openreview.net/forum?id=PxoFut3dWW

work page 2024
[31]

Y. Fu, H. Peng, L. Ou, A. Sabharwal, T. Khot, Specializing smaller language models towards multi-step reasoning, in: International Con- ference on Machine Learning, PMLR, 2023, pp. 10421–10430

work page 2023
[32]

M. Wu, A. Waheed, C. Zhang, M. Abdul-Mageed, A. F. Aji, LaMini- LM: A diverse herd of distilled models from large-scale instructions, in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Con- ference of the European Chapter of the Association for Computa- tional Linguistics (Volume 1: Long Papers), Association for Com- putational Linguistics, St. Juli...

work page 2024
[33]

Frantar, S

E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, OPTQ: Accu- rate quantization for generative pre-trained transformers, in: The Eleventh International Conference on Learning Representations, 2023. URL:https://openreview.net/forum?id=tcbBPnfwxS

work page 2023
[34]

Sheng, L

Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, C. Zhang, Flexgen: High-throughput generative in- ference of large language models with a single gpu, in: International Conference on Machine Learning, PMLR, 2023, pp. 31094–31116

work page 2023
[35]

Z. Xuan, B. Jia, H. Zhou, Z. Liu, S. Cheng, Y. You, Hetegen: Efficient heterogeneous parallel inference for large language models on resource- constrained devices, Proceedings of Machine Learning and Systems 6 (2024) 162–172

work page 2024
[36]

J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, Y. He, Zero-offload: Democratizing billion-scale model training, in: 2021USENIXAnnualTechnicalConference(USENIXATC 21), 2021, pp. 551–564

work page 2021
[37]

Adnan, A

M. Adnan, A. Arunkumar, G. Jain, P. Nair, I. Soloveychik, P. Kamath, Keyformer: Kv cache reduction through key tokens selection for efficient generative inference, Proceedings of Machine Learning and Systems 6 (2024) 114–127

work page 2024
[38]

J. Li, Y. Jiang, Y. Zhu, C. Wang, H. Xu, Accelerating distributed moe training and inference with lina, in: 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 945–959

work page 2023
[39]

S. Shi, X. Pan, Q. Wang, C. Liu, X. Ren, Z. Hu, Y. Yang, B. Li, X. Chu, Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling, in: Proceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 236–249

work page 2024
[40]

R. Kong, Y. Li, Q. Feng, W. Wang, X. Ye, Y. Ouyang, L. Kong, Y. Liu, Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget, in: Proceedings of the 62nd Annual Meeting of theAssociationforComputationalLinguistics(Volume1: LongPapers), 2024, pp. 6710–6720. 28

work page 2024

[1] [1]

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Accessed: 2024-09-09

OpenAI, Chatgpt (gpt-4),https://www.openai.com, 2024. Accessed: 2024-09-09

work page 2024

[3] [3]

Shazeer, A

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, in: Interna- tional Conference on Learning Representations, 2017. URL: https://openreview.net/forum?id=B1ckMDqlg

work page 2017

[4] [4]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017)

work page 2017

[5] [5]

Fedus, B

W. Fedus, B. Zoph, N. Shazeer, Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, Journal of Machine Learning Research 23 (2022) 1–39. 24

work page 2022

[6] [6]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al., Mixtral of experts, arXiv preprint arXiv:2401.04088 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, et al., Deepseekmoe: Towards ultimate expert spe- cialization in mixture-of-experts language models, arXiv preprint arXiv:2401.06066 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, K. Huang, Mobile edge intelligence for large language models: A contemporary survey, arXiv preprint arXiv:2407.18921 (2024)

work page arXiv 2024

[9] [9]

N. Dhar, B. Deng, D. Lo, X. Wu, L. Zhao, K. Suo, An empirical analysis and resource footprint study of deploying large language models on edge devices, in: Proceedings of the 2024 ACM Southeast Conference, 2024, pp. 69–76

work page 2024

[10] [10]

D. Yu, L. Shen, H. Hao, W. Gong, H. Wu, J. Bian, L. Dai, H. Xiong, Moesys: A distributed and efficient mixture-of-experts training and in- ference system for internet services, IEEE Transactions on Services Computing (2024)

work page 2024

[11] [11]

HuggingFace, Huggingface accelerate,https://huggingface.co/docs /accelerate/index, 2022

work page 2022

[12] [12]

Hwang, J

R. Hwang, J. Wei, S. Cao, C. Hwang, X. Tang, T. Cao, M. Yang, Pre- gatedmoe: Analgorithm-systemco-designforfastandscalablemixture- of-expert inference, in: 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), IEEE, 2024, pp. 1018– 1031

work page 2024

[13] [13]

R. Yi, L. Guo, S. Wei, A. Zhou, S. Wang, M. Xu, Edgemoe: Fast on- device inference of moe-based large language models, arXiv preprint arXiv:2308.14352 (2023)

work page arXiv 2023

[14] [14]

Z. Du, S. Li, Y. Wu, X. Jiang, J. Sun, Q. Zheng, Y. Wu, A. Li, H. Li, Y. Chen, Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-expertsmodels, Proceedings of MachineLearn- ing and Systems 6 (2024) 224–238. 25

work page 2024

[15] [15]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, I. Stoica, Efficient memory management for large language model serving with pagedattention, in: Proceedings of the 29th Sympo- sium on Operating Systems Principles, 2023, pp. 611–626

work page 2023

[16] [16]

R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley, et al., Deepspeed-inference: enablingefficientinferenceoftransformermodelsatunprecedentedscale, in: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2022, pp. 1–15

work page 2022

[17] [17]

Lepikhin, H

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, Z. Chen, Gshard: Scaling giant mod- els with conditional computation and automatic sharding, in: In- ternational Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id=qrwe7XHTmYb

work page 2021

[18] [18]

L. Xue, Y. Fu, Z. Lu, L. Mai, M. Marina, Moe-infinity: Activation- aware expert offloading for efficient moe serving, arXiv preprint arXiv:2401.14361 (2024)

work page arXiv 2024

[19] [19]

Hochreiter, Long short-term memory, Neural Computation MIT- Press (1997)

S. Hochreiter, Long short-term memory, Neural Computation MIT- Press (1997)

work page 1997

[20] [20]

J. Yao, Q. Anthony, A. Shafi, H. Subramoni, D. K. D. Panda, Exploiting inter-layer expert affinity for accelerating mixture-of-experts model in- ference, in: 2024 IEEE International Parallel andDistributed Processing Symposium (IPDPS), IEEE, 2024, pp. 915–925

work page 2024

[21] [21]

U. Ruby, V. Yendapalli, Binary cross entropy with deep learning tech- nique for image classification, International Journal of Advanced Trends in Computer Science and Engineering 9 (2020)

work page 2020

[22] [22]

URL: http://on-demand.gputechconf.com/gtc-express/2011/presen tations/StreamsAndConcurrencyWebinar.pdf

NVIDIA Corporation, Nvidia cuda c/c++ streams and concurrency, 2021. URL: http://on-demand.gputechconf.com/gtc-express/2011/presen tations/StreamsAndConcurrencyWebinar.pdf

work page 2021

[23] [23]

URL: https://developer.nvidia.com/cuda-toolkit, version 12.0

NVIDIA Corporation, Nvidia cuda toolkit, 2020. URL: https://developer.nvidia.com/cuda-toolkit, version 12.0. 26

work page 2020

[24] [24]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, S. Han, Awq: Activation-aware weight quantization for on-device llm compression and acceleration, Proceedings of Machine Learning and Systems 6 (2024) 87–100

work page 2024

[25] [25]

Know What You Don't Know: Unanswerable Questions for SQuAD

P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswer- able questions for squad, arXiv preprint arXiv:1806.03822 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Orca-math: Unlocking the potential of SLMs in grade school math.arXiv preprint, arXiv:2402.14830, 2024

A. Mitra, H. Khanpour, C. Rosset, A. Awadallah, Orca-math: Un- locking the potential of slms in grade school math, arXiv preprint arXiv:2402.14830 (2024)

work page arXiv 2024

[27] [27]

Frantar, D

E. Frantar, D. Alistarh, Qmoe: Sub-1-bit compression of trillion param- eter models, Proceedings of Machine Learning and Systems 6 (2024) 439–451

work page 2024

[28] [28]

T. Chen, S. Huang, Y. Xie, B. Jiao, D. Jiang, H. Zhou, J. Li, F.Wei, Task-specificexpertpruningforsparsemixture-of-experts, arXiv preprint arXiv:2206.00277 (2022)

work page arXiv 2022

[29] [29]

Frantar, D

E. Frantar, D. Alistarh, Sparsegpt: Massive language models can be accurately pruned in one-shot, in: International Conference on Machine Learning, PMLR, 2023, pp. 10323–10337

work page 2023

[30] [30]

M. Sun, Z. Liu, A. Bair, J. Z. Kolter, A simple and effec- tive pruning approach for large language models, in: The Twelfth International Conference on Learning Representations, 2024. URL: https://openreview.net/forum?id=PxoFut3dWW

work page 2024

[31] [31]

Y. Fu, H. Peng, L. Ou, A. Sabharwal, T. Khot, Specializing smaller language models towards multi-step reasoning, in: International Con- ference on Machine Learning, PMLR, 2023, pp. 10421–10430

work page 2023

[32] [32]

M. Wu, A. Waheed, C. Zhang, M. Abdul-Mageed, A. F. Aji, LaMini- LM: A diverse herd of distilled models from large-scale instructions, in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Con- ference of the European Chapter of the Association for Computa- tional Linguistics (Volume 1: Long Papers), Association for Com- putational Linguistics, St. Juli...

work page 2024

[33] [33]

Frantar, S

E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, OPTQ: Accu- rate quantization for generative pre-trained transformers, in: The Eleventh International Conference on Learning Representations, 2023. URL:https://openreview.net/forum?id=tcbBPnfwxS

work page 2023

[34] [34]

Sheng, L

Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, C. Zhang, Flexgen: High-throughput generative in- ference of large language models with a single gpu, in: International Conference on Machine Learning, PMLR, 2023, pp. 31094–31116

work page 2023

[35] [35]

Z. Xuan, B. Jia, H. Zhou, Z. Liu, S. Cheng, Y. You, Hetegen: Efficient heterogeneous parallel inference for large language models on resource- constrained devices, Proceedings of Machine Learning and Systems 6 (2024) 162–172

work page 2024

[36] [36]

J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, Y. He, Zero-offload: Democratizing billion-scale model training, in: 2021USENIXAnnualTechnicalConference(USENIXATC 21), 2021, pp. 551–564

work page 2021

[37] [37]

Adnan, A

M. Adnan, A. Arunkumar, G. Jain, P. Nair, I. Soloveychik, P. Kamath, Keyformer: Kv cache reduction through key tokens selection for efficient generative inference, Proceedings of Machine Learning and Systems 6 (2024) 114–127

work page 2024

[38] [38]

J. Li, Y. Jiang, Y. Zhu, C. Wang, H. Xu, Accelerating distributed moe training and inference with lina, in: 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 945–959

work page 2023

[39] [39]

S. Shi, X. Pan, Q. Wang, C. Liu, X. Ren, Z. Hu, Y. Yang, B. Li, X. Chu, Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling, in: Proceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 236–249

work page 2024

[40] [40]

R. Kong, Y. Li, Q. Feng, W. Wang, X. Ye, Y. Ouyang, L. Kong, Y. Liu, Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget, in: Proceedings of the 62nd Annual Meeting of theAssociationforComputationalLinguistics(Volume1: LongPapers), 2024, pp. 6710–6720. 28

work page 2024