Alloc-MoE allocates a fixed expert activation budget using layer-level dynamic programming based on sensitivity and token-level score-based redistribution, delivering 1.15x prefill and 1.34x decode speedups on DeepSeek-V2-Lite at half the original budget while preserving performance.
InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 6159–6172, Bangkok, Thailand
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference
Alloc-MoE allocates a fixed expert activation budget using layer-level dynamic programming based on sensitivity and token-level score-based redistribution, delivering 1.15x prefill and 1.34x decode speedups on DeepSeek-V2-Lite at half the original budget while preserving performance.