Alloc-MoE allocates a fixed expert activation budget using layer-level dynamic programming based on sensitivity and token-level score-based redistribution, delivering 1.15x prefill and 1.34x decode speedups on DeepSeek-V2-Lite at half the original budget while preserving performance.
Krishna Teja Chitty-Venkata, Sandeep Madireddy, Mu- rali Emani, and Venkatram Vishwanath
2 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.
citing papers explorer
-
Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference
Alloc-MoE allocates a fixed expert activation budget using layer-level dynamic programming based on sensitivity and token-level score-based redistribution, delivering 1.15x prefill and 1.34x decode speedups on DeepSeek-V2-Lite at half the original budget while preserving performance.
-
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.