Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

Binfan Zheng; Can Chen; Dacheng Tao; Fangcheng Liu; Fei Mi; Hang Zhou; Hanting Chen; Hui Zang; Jinpeng Li; Kai Han

arxiv: 2505.21411 · v2 · pith:FBOWBOYJnew · submitted 2025-05-27 · 💻 cs.CL

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

Yehui Tang , Xiaosong Li , Fangcheng Liu , Wei Guo , Hang Zhou , Yaoyuan Wang , Kai Han , Xianzhi Yu

show 14 more authors

Jinpeng Li Hui Zang Fei Mi Xiaojun Meng Zhicheng Liu Hanting Chen Binfan Zheng Can Chen Youliang Yan Ruiming Tang Peifeng Qin Xinghao Chen Dacheng Tao Yunhe Wang (and Other Contributors)

This is my paper

classification 💻 cs.CL

keywords expertsmodelascendpanguinferenceactivateddevicesexecution

0 comments

read the original abstract

The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature. It constrains tokens to activate an equal number of experts within each predefined expert group. When a model execution is distributed on multiple devices, this architectural design ensures a balanced computational load across devices, significantly enhancing throughput, particularly for the inference phase. Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGE with 72 billion total parameters, 16 billion of which are activated for each token. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and 800I A2 through extensive system simulation studies. Our experiments indicate that MoGE indeed leads to better expert load balancing and more efficient execution for both model training and inference on Ascend NPUs. The inference performance of Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration, outperforming comparable 32B and 72B Dense models. Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I Duo. Our studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
cs.LG 2026-05 unverdicted novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models
cs.LG 2026-05 unverdicted novelty 6.0

Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.
NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference
cs.AR 2026-05 unverdicted novelty 6.0

NASiC fuses CAM-based expert selection and multibit CIM computation in 3D NAND into one cycle for MoE LLM inference, claiming 4-114.8x performance and 3.9-70x energy efficiency gains over prior designs with high accuracy.
Hierarchical Mixture-of-Experts with Two-Stage Optimization
cs.LG 2026-05 unverdicted novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...
Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
cs.CV 2026-04 unverdicted novelty 6.0

SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
Intrinsic Fingerprint of LLMs: Continue Training is NOT All You Need to Steal A Model!
cs.CR 2025-07 unverdicted novelty 6.0

Standard deviation distributions of attention matrices in LLMs remain distinctive and stable after continued training, enabling fingerprinting to trace model lineage and detect potential plagiarism such as in Pangu Pro MoE.
Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression
cs.LG 2026-06 unverdicted novelty 5.0

Tensor decompositions face practical limits in large-scale LLM compression due to mismatch between assumed shared subspaces and heterogeneous model representations.