HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs
Pith reviewed 2026-05-25 02:47 UTC · model grok-4.3
The pith
HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x on Ascend NPUs by turning serialized operators into a tile-level taskflow across matrix and vector units inside one kernel launch.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. The compiled taskflow executes within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while the
What carries the argument
The tile-level heterogeneous taskflow spanning AIC matrix and AIV vector units, built from AIV-driven one-sided communication, dependency-preserving tile task generation, and event-driven static scheduling, executed inside a single kernel launch.
If this is right
- Dispatch-to-Combine MoE-FFN latency drops by up to 1.58x across multiple expert-parallel configurations.
- Fine-grained overlap occurs among communication, matrix computation, and vector computation.
- Existing optimized operators remain unchanged inside the unified runtime.
- The approach integrates into the MindSpore and MindFormers stack for practical MoE training.
Where Pith is reading between the lines
- The same tile-task abstraction could be applied to other MoE phases such as routing or all-to-all beyond Dispatch-to-Combine.
- Compiler-generated taskflows of this style might transfer to other NPUs that expose separate matrix and vector queues with event synchronization.
- If single-kernel-launch overhead stays low, the technique could shorten wall-clock time for full MoE pre-training runs without altering model architecture.
Load-bearing premise
The assumption that AIV-driven one-sided communication, dependency-preserving tile task generation, and event-driven static scheduling can be realized inside a single kernel launch without correctness issues or substantial runtime overhead while preserving existing optimized operators.
What would settle it
An experiment on Ascend A3 clusters running DeepSeek-style MoE models that measures either incorrect outputs from dependency violations or no net latency gain once single-kernel-launch overhead is included.
Figures
read the original abstract
Modern Mixture-of-Experts (MoE) models increasingly rely on large-scale AI accelerator clusters for efficient training. Ascend NPUs expose heterogeneous on-chip compute resources, including matrix-oriented AIC units and vector-oriented AIV units with explicit cross-queue synchronization support. However, existing training frameworks largely execute MoE operators in a serialized kernel-by-kernel manner, leaving substantial heterogeneous parallelism underutilized. This paper presents HyperParallel-MoE, a compilation and scheduling framework for MoE training on Ascend NPUs. HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. HyperParallel-MoE further executes the compiled taskflow within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while preserving existing optimized operators. We implement HyperParallel-MoE in the MindSpore and MindFormers stack and evaluate it using DeepSeek-style MoE models on Ascend A3 clusters. Across multiple expert-parallel configurations, HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x, demonstrating that tile-level heterogeneous scheduling can substantially improve MoE training efficiency on modern NPUs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HyperParallel-MoE, a compilation and scheduling framework for Mixture-of-Experts (MoE) training on Ascend NPUs. It transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow that spans AIC matrix units and AIV vector units. Key techniques include AIV-driven one-sided communication to remove host-side synchronization, dependency-preserving tile task generation to unify communication and computation, and event-driven static scheduling for cross-queue coordination. The framework executes the taskflow in a unified runtime inside a single kernel launch to enable fine-grained overlap of communication, matrix, and vector computation while preserving existing optimized operators. It is implemented in the MindSpore/MindFormers stack and evaluated on DeepSeek-style MoE models on Ascend A3 clusters, reporting up to 1.58x reduction in Dispatch-to-Combine MoE-FFN latency across multiple expert-parallel configurations.
Significance. If the empirical results hold, the work is significant for demonstrating how to exploit heterogeneous on-chip resources (AIC/AIV) and explicit synchronization primitives on Ascend NPUs for MoE training, an increasingly important workload. The approach of compiling to a tile-level taskflow with static scheduling and single-kernel execution offers a concrete method to improve efficiency without altering existing high-performance operators. This is relevant to the distributed systems and high-performance computing community working on accelerator-specific optimizations for large models.
minor comments (1)
- The abstract states a 1.58x latency reduction but the provided text does not include the experimental section; the manuscript should ensure that § on evaluation supplies full baselines, configurations, error bars, and data-exclusion criteria so the central claim can be verified.
Simulated Author's Rebuttal
We thank the referee for their summary of the manuscript and for recognizing the potential significance of tile-level heterogeneous scheduling for MoE training on Ascend NPUs. The recommendation is listed as uncertain, but the report contains no specific major comments to address.
Circularity Check
No significant circularity
full rationale
The paper describes an engineering framework for tile-level heterogeneous scheduling of MoE operators on Ascend NPUs and reports measured latency reductions (up to 1.58x) from experiments on DeepSeek-style models. No mathematical derivation chain, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text or abstract. The central claim is an empirical outcome of the implemented scheduling techniques rather than a result that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Osayamen Aimuyo, Byungsoo Oh, and Rachee Singh. 2025. Flash- MoE: Fast Distributed MoE in a Single Kernel. InAdvances in Neural Information Processing Systems, D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen (Eds.), Vol. 38. Curran Associates, Inc., Red Hook, NY, USA, 100676– 100699.https://proceedings.neurips.cc/paper_file...
2025
-
[2]
Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. 2024. FLUX: Fast Software- based Communication Overlap On GPUs Through Kernel Fusion. arXiv:2406.06858 [cs.LG]https://arxiv.org/abs/2406.06858
-
[3]
Yu Cheng, Lei Wang, Yining Shi, Yuqing Xia, Lingxiao Ma, Jilong Xue, Yang Wang, Zhiwen Mo, Feiyang Chen, Fan Yang, Mao Yang, and Zhi Yang. 2025. PipeThreader: Software-Defined Pipelining for Efficient DNN Execution. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, 15 Boston, MA, 767–783.https://www.usen...
2025
-
[4]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, et al. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of- Experts Language Models. arXiv:2401.06066 [cs.CL]https://arxiv.org/ abs/2401.06066
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
-
[6]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., Red Hook, NY, USA, 16344–16359. https://arxiv.org/abs/2205.14135
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434 [cs.CL]https: //arxiv.org/abs/2405.04434
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]https://arxiv.org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
DeepSeek-AI. 2025. DeepEP.https://github.com/deepseek-ai/DeepEP
2025
-
[10]
DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million- Token Context Intelligence. Technical report.https://huggingface. co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Ac- cessed May 19, 2026
2026
-
[11]
DeepSeek-AI. 2026. MegaMoE.https://github.com/deepseek-ai/ DeepGEMM/pull/304. Merged Apr. 17, 2026
2026
-
[12]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39.http://jmlr.org/papers/v23/21-0998.html
2022
-
[13]
Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, and Yizhou Shan. 2025. DEEPSERVE: Serverless Large Language Model Serving at Scale. In 2025 USENIX Annual Technical Conference ...
2025
-
[14]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Men- sch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. 2024. Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping. InProceedings of Machine Learning and Systems, Vol. 6. MLSys, Santa Clara, CA, USA, 13 pages. arXiv:2404.19429 [cs.DC] https://proceedings.mlsys.org/paper_files/paper/2024/file...
-
[16]
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...
2024
-
[17]
MindSpore Contributors. 2020. MindSpore.https://www.mindspore. cn/
2020
-
[18]
MindSpore Contributors. 2024. MindSpore Transformers.https://www. mindspore.cn/mindformers/docs/en/master/mindformers.html
2024
-
[19]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Granger, Phil Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Princi- ples. Association for Computing Machinery, Huntsville, ON, Canada, 15 pages. doi:10.1145/3...
-
[20]
Qwen Team. 2024. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]https://arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Le, Geoffrey E
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open- Review.net, Toulo...
2017
-
[22]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]https://arxiv.org/abs/1909.08053
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[23]
Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xi- aosong Ma, and Cheng Li. 2025. DHeLlam: General-Purpose, Auto- matic Micro-Batch Co-Execution for Distributed LLM Training. In 2025 IEEE 43rd International Conference on Computer Design (ICCD). 70–78. doi:10.1109/ICCD65941.2025.00017
-
[24]
Jinwu Yang, Jiaan Wu, Zedong Liu, Xinyang Ma, Hairui Zhao, Yida Gu, Yuanhong Huang, Xingchen Liu, Wenjing Huang, Zheng Wei, Jing Xing, Yili Ma, Qingyi Zhang, Baoyi An, Zhongzhe Hu, Shaoteng Liu, Xia Zhu, Jiaxun Lu, Guangming Tan, and Dingwen Tao. 2026. ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs. arXiv:2604.03298 [c...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wen- lei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li- Wen Chang, Quan Chen, and Xin Liu. 2025. COMET: Fine- grained Computation-communication Overlapping for Mixture-of- Experts. InProceedings of Machine Learning and Systems, M. Zaharia, G. Joshi, and Y. Lin (Eds.), Vol. 7. MLSys, Santa C...
2025
-
[26]
Chenggang Zhao, Zhean Xu, Liang Zhao, Jiashi Li, Chenhao Xu, Anyi Xu, Shengyu Liu, Kexing Zhou, and Kuai Yu. 2025. DeepGEMM: clean and efficient BLAS kernel library on GPU.https://github.com/ deepseek-ai/DeepGEMM
2025
-
[27]
Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chen- hui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yi- fan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu. 2025. Triton-distributed: Programming Overlap- ping Kernels on Distributed AI Syst...
-
[28]
Size Zheng, Xuegui Zheng, Li-wen Chang, and Jidong Zhai. 2026. UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training. arXiv:2604.19241 [cs.DC]https://arxiv.org/abs/2604.19241
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, Yanlin Liu, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, Wanchun Dou, Guihai Chen, and Chen Tian. 2025. Squeezing Operator Performance Potential for the Ascend Architecture. InProceedings of the 30th ACM International Conference...
-
[30]
Yuhang Zhou, Zibo Wang, Zhibin Wang, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Bingqiang Wang, Yonghong Tian, Yan Zhang, Hui Wang, Fuchun Wei, Boquan Sun, Jingyi Zhang, Bin She, Teng Su, Yifan Yao, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, and Guyue Liu. 2025. Accelerating Model Training on Ascend Chips: An Industrial Syst...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.