pith. sign in

arxiv: 2605.23764 · v1 · pith:IMOO45BQnew · submitted 2026-05-22 · 💻 cs.DC

HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

Pith reviewed 2026-05-25 02:47 UTC · model grok-4.3

classification 💻 cs.DC
keywords Mixture-of-ExpertsAscend NPUheterogeneous schedulingMoE trainingtile-level taskflowexpert parallelismAIV communication
0
0 comments X

The pith

HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x on Ascend NPUs by turning serialized operators into a tile-level taskflow across matrix and vector units inside one kernel launch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HyperParallel-MoE to address underutilized heterogeneous resources on Ascend NPUs, where matrix-oriented AIC units and vector-oriented AIV units sit idle during serialized MoE kernel execution. It converts operator-level MoE work into a statically scheduled tile-level taskflow that unifies communication and computation under one abstraction. Three techniques enable this: AIV-driven one-sided communication that removes host-side collectives, dependency-preserving tile task generation, and event-driven static scheduling for cross-queue coordination. The entire taskflow then runs concurrently on AIC and AIV workers from a single kernel launch, preserving existing optimized operators. Evaluation on DeepSeek-style models across expert-parallel setups on A3 clusters shows the latency reduction.

Core claim

HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. The compiled taskflow executes within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while the

What carries the argument

The tile-level heterogeneous taskflow spanning AIC matrix and AIV vector units, built from AIV-driven one-sided communication, dependency-preserving tile task generation, and event-driven static scheduling, executed inside a single kernel launch.

If this is right

  • Dispatch-to-Combine MoE-FFN latency drops by up to 1.58x across multiple expert-parallel configurations.
  • Fine-grained overlap occurs among communication, matrix computation, and vector computation.
  • Existing optimized operators remain unchanged inside the unified runtime.
  • The approach integrates into the MindSpore and MindFormers stack for practical MoE training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tile-task abstraction could be applied to other MoE phases such as routing or all-to-all beyond Dispatch-to-Combine.
  • Compiler-generated taskflows of this style might transfer to other NPUs that expose separate matrix and vector queues with event synchronization.
  • If single-kernel-launch overhead stays low, the technique could shorten wall-clock time for full MoE pre-training runs without altering model architecture.

Load-bearing premise

The assumption that AIV-driven one-sided communication, dependency-preserving tile task generation, and event-driven static scheduling can be realized inside a single kernel launch without correctness issues or substantial runtime overhead while preserving existing optimized operators.

What would settle it

An experiment on Ascend A3 clusters running DeepSeek-style MoE models that measures either incorrect outputs from dependency violations or no net latency gain once single-kernel-launch overhead is included.

Figures

Figures reproduced from arXiv: 2605.23764 by Cheng Li, Congkun Ai, Da Lei, Guangpeng Zhang, Hanbo Zhang, Haoran Wang, Shihan Xiao, Teng Su, Xuefeng Jin, Zewen Jin.

Figure 1
Figure 1. Figure 1: Ascend NPU heterogeneous AIC/AIV execution model. are resolved offline. We integrate HyperParallel-MoE into the MindSpore and MindFormers training stack [16, 17] with low code intrusion, while preserving existing optimized im￾plementations of GMM, SwiGLU, and communication opera￾tors. We evaluate HyperParallel-MoE using DeepSeek-V3-style MoE models [7] on clusters of Ascend A3 NPUs. Across EP4, EP8, and EP… view at source ↗
Figure 2
Figure 2. Figure 2: Forward and backward MoE-FFN operator graph with AIC/AIV mapping. to form the final MoE output. Representative MoE models in￾clude DeepSeek-V2 [6], DeepSeek-V3 [7], Mixtral 8×7B [13], and Qwen2.5-MoE [19]. To better support Mixture-of-Experts (MoE) training on A3 NPUs, we examine its computational structure in depth. Consider the MoE feed-forward network (MoE-FFN) as a rep￾resentative example. Its forward … view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end training step time breakdown on As￾cend A3. D0 I G⭡0 G⭣0 SG0 Cube Vector Dispatch / Combine Idle GMM_gate / up GMM_down SwiGLU Dispatch Cube Idle Vector GMM_gate / up Idle SwiGLU Idle GMM_down Idle Idle (a) Kernel-by-Kernel Execution (b) Tile-Level AIC/AIV Pipeline G⭡1 G⭡2 G⭣1 G⭣2 SG1SG2 I D1 D2 CB0 CB1 CB2 Combine [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Kernel-by-kernel execution versus tile-level AIC/AIV pipelining. After SwiGLUgrad, GMMgate_grad and GMMw1_grad become in￾dependent consumers; backward Combine then returns the resulting input activation gradient [7, 16]. These operators stress different hardware resources. GMM operators mainly use Cube matrix engines, whereas Dispatch, Combine, SwiGLU, activation gradients, and data movement map mostly to … view at source ↗
Figure 5
Figure 5. Figure 5: Overview of HyperParallel-MoE. decomposes them into fine-grained tile tasks and organizes these tasks into concurrent execution streams across hetero￾geneous hardware queues. At a high level, HyperParallel-MoE shifts MoE execution from a kernel-centric model to a taskflow-centric model. Dur￾ing compilation, the framework analyzes operator depen￾dencies, legal tiling strategies, tensor layouts, and hardware… view at source ↗
Figure 6
Figure 6. Figure 6: Rank-Aware Task Reordering (RATR). The naive order creates destination-rank hotspots, while RATR rotates each rank’s task order to form a balanced communication pattern. both the activation-gradient GMM and the down-projection weight-gradient GMM consume the dispatched expert acti￾vations without depending on each other. If the scheduler ex￾ecutes one GMM branch in its entirety before launching the other, … view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end latency for one training step with sam￾pled natural routing. Bar annotations report total step-level speedup over the standard operator-by-operator baseline. Balanced routing [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Forward/backward Dispatch-to-Combine latency breakdown under balanced routing. Bar annotations report total speedup over the standard operator-by-operator base￾line. execution path with full-device operators, full-core exclusive execution, and collective AllToAll communication. For end￾to-end step latency, the baseline also retains MindSpore’s DVM-level automatic fusion and graph-level execution plan￾ning,… view at source ↗
Figure 9
Figure 9. Figure 9: SwiGLU+Add cache microbenchmarks under serial and tile-interleaved execution. Left: execution latency. Right: L2 cache hit rate. 6 Microbenchmarks Section 5 reports both Dispatch-to-Combine MoE-FFN mod￾ule latency and end-to-end training-step latency after com￾munication, computation, synchronization, and ordering op￾timizations are applied together. This section complements that evaluation with focused mi… view at source ↗
read the original abstract

Modern Mixture-of-Experts (MoE) models increasingly rely on large-scale AI accelerator clusters for efficient training. Ascend NPUs expose heterogeneous on-chip compute resources, including matrix-oriented AIC units and vector-oriented AIV units with explicit cross-queue synchronization support. However, existing training frameworks largely execute MoE operators in a serialized kernel-by-kernel manner, leaving substantial heterogeneous parallelism underutilized. This paper presents HyperParallel-MoE, a compilation and scheduling framework for MoE training on Ascend NPUs. HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. HyperParallel-MoE further executes the compiled taskflow within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while preserving existing optimized operators. We implement HyperParallel-MoE in the MindSpore and MindFormers stack and evaluate it using DeepSeek-style MoE models on Ascend A3 clusters. Across multiple expert-parallel configurations, HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x, demonstrating that tile-level heterogeneous scheduling can substantially improve MoE training efficiency on modern NPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper introduces HyperParallel-MoE, a compilation and scheduling framework for Mixture-of-Experts (MoE) training on Ascend NPUs. It transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow that spans AIC matrix units and AIV vector units. Key techniques include AIV-driven one-sided communication to remove host-side synchronization, dependency-preserving tile task generation to unify communication and computation, and event-driven static scheduling for cross-queue coordination. The framework executes the taskflow in a unified runtime inside a single kernel launch to enable fine-grained overlap of communication, matrix, and vector computation while preserving existing optimized operators. It is implemented in the MindSpore/MindFormers stack and evaluated on DeepSeek-style MoE models on Ascend A3 clusters, reporting up to 1.58x reduction in Dispatch-to-Combine MoE-FFN latency across multiple expert-parallel configurations.

Significance. If the empirical results hold, the work is significant for demonstrating how to exploit heterogeneous on-chip resources (AIC/AIV) and explicit synchronization primitives on Ascend NPUs for MoE training, an increasingly important workload. The approach of compiling to a tile-level taskflow with static scheduling and single-kernel execution offers a concrete method to improve efficiency without altering existing high-performance operators. This is relevant to the distributed systems and high-performance computing community working on accelerator-specific optimizations for large models.

minor comments (1)
  1. The abstract states a 1.58x latency reduction but the provided text does not include the experimental section; the manuscript should ensure that § on evaluation supplies full baselines, configurations, error bars, and data-exclusion criteria so the central claim can be verified.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for recognizing the potential significance of tile-level heterogeneous scheduling for MoE training on Ascend NPUs. The recommendation is listed as uncertain, but the report contains no specific major comments to address.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an engineering framework for tile-level heterogeneous scheduling of MoE operators on Ascend NPUs and reports measured latency reductions (up to 1.58x) from experiments on DeepSeek-style models. No mathematical derivation chain, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text or abstract. The central claim is an empirical outcome of the implemented scheduling techniques rather than a result that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an engineering framework that relies on existing hardware features of Ascend NPUs and prior optimized operators; no free parameters, new axioms, or invented entities are introduced or fitted.

pith-pipeline@v0.9.0 · 5838 in / 1062 out tokens · 17580 ms · 2026-05-25T02:47:29.580812+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 15 canonical work pages · 9 internal anchors

  1. [1]

    Osayamen Aimuyo, Byungsoo Oh, and Rachee Singh. 2025. Flash- MoE: Fast Distributed MoE in a Single Kernel. InAdvances in Neural Information Processing Systems, D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen (Eds.), Vol. 38. Curran Associates, Inc., Red Hook, NY, USA, 100676– 100699.https://proceedings.neurips.cc/paper_file...

  2. [2]

    Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. 2024. FLUX: Fast Software- based Communication Overlap On GPUs Through Kernel Fusion. arXiv:2406.06858 [cs.LG]https://arxiv.org/abs/2406.06858

  3. [3]

    Yu Cheng, Lei Wang, Yining Shi, Yuqing Xia, Lingxiao Ma, Jilong Xue, Yang Wang, Zhiwen Mo, Feiyang Chen, Fan Yang, Mao Yang, and Zhi Yang. 2025. PipeThreader: Software-Defined Pipelining for Efficient DNN Execution. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, 15 Boston, MA, 767–783.https://www.usen...

  4. [4]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, et al. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of- Experts Language Models. arXiv:2401.06066 [cs.CL]https://arxiv.org/ abs/2401.06066

  5. [5]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

  6. [6]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., Red Hook, NY, USA, 16344–16359. https://arxiv.org/abs/2205.14135

  7. [7]

    DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434 [cs.CL]https: //arxiv.org/abs/2405.04434

  8. [8]

    DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]https://arxiv.org/abs/2412.19437

  9. [9]

    DeepSeek-AI. 2025. DeepEP.https://github.com/deepseek-ai/DeepEP

  10. [10]

    DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million- Token Context Intelligence. Technical report.https://huggingface. co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Ac- cessed May 19, 2026

  11. [11]

    DeepSeek-AI. 2026. MegaMoE.https://github.com/deepseek-ai/ DeepGEMM/pull/304. Merged Apr. 17, 2026

  12. [12]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39.http://jmlr.org/papers/v23/21-0998.html

  13. [13]

    Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, and Yizhou Shan. 2025. DEEPSERVE: Serverless Large Language Model Serving at Scale. In 2025 USENIX Annual Technical Conference ...

  14. [14]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Men- sch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

  15. [15]

    Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. 2024. Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping. InProceedings of Machine Learning and Systems, Vol. 6. MLSys, Santa Clara, CA, USA, 13 pages. arXiv:2404.19429 [cs.DC] https://proceedings.mlsys.org/paper_files/paper/2024/file...

  16. [16]

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

  17. [17]

    MindSpore Contributors. 2020. MindSpore.https://www.mindspore. cn/

  18. [18]

    MindSpore Contributors. 2024. MindSpore Transformers.https://www. mindspore.cn/mindformers/docs/en/master/mindformers.html

  19. [19]

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Granger, Phil Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Princi- ples. Association for Computing Machinery, Huntsville, ON, Canada, 15 pages. doi:10.1145/3...

  20. [20]

    Qwen Team. 2024. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]https://arxiv.org/abs/2412.15115

  21. [21]

    Le, Geoffrey E

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open- Review.net, Toulo...

  22. [22]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]https://arxiv.org/abs/1909.08053

  23. [23]

    Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xi- aosong Ma, and Cheng Li. 2025. DHeLlam: General-Purpose, Auto- matic Micro-Batch Co-Execution for Distributed LLM Training. In 2025 IEEE 43rd International Conference on Computer Design (ICCD). 70–78. doi:10.1109/ICCD65941.2025.00017

  24. [24]

    Jinwu Yang, Jiaan Wu, Zedong Liu, Xinyang Ma, Hairui Zhao, Yida Gu, Yuanhong Huang, Xingchen Liu, Wenjing Huang, Zheng Wei, Jing Xing, Yili Ma, Qingyi Zhang, Baoyi An, Zhongzhe Hu, Shaoteng Liu, Xia Zhu, Jiaxun Lu, Guangming Tan, and Dingwen Tao. 2026. ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs. arXiv:2604.03298 [c...

  25. [25]

    Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wen- lei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li- Wen Chang, Quan Chen, and Xin Liu. 2025. COMET: Fine- grained Computation-communication Overlapping for Mixture-of- Experts. InProceedings of Machine Learning and Systems, M. Zaharia, G. Joshi, and Y. Lin (Eds.), Vol. 7. MLSys, Santa C...

  26. [26]

    Chenggang Zhao, Zhean Xu, Liang Zhao, Jiashi Li, Chenhao Xu, Anyi Xu, Shengyu Liu, Kexing Zhou, and Kuai Yu. 2025. DeepGEMM: clean and efficient BLAS kernel library on GPU.https://github.com/ deepseek-ai/DeepGEMM

  27. [27]

    Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chen- hui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yi- fan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu. 2025. Triton-distributed: Programming Overlap- ping Kernels on Distributed AI Syst...

  28. [28]

    Size Zheng, Xuegui Zheng, Li-wen Chang, and Jidong Zhai. 2026. UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training. arXiv:2604.19241 [cs.DC]https://arxiv.org/abs/2604.19241

  29. [29]

    Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, Yanlin Liu, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, Wanchun Dou, Guihai Chen, and Chen Tian. 2025. Squeezing Operator Performance Potential for the Ascend Architecture. InProceedings of the 30th ACM International Conference...

  30. [30]

    Yuhang Zhou, Zibo Wang, Zhibin Wang, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Bingqiang Wang, Yonghong Tian, Yan Zhang, Hui Wang, Fuchun Wei, Boquan Sun, Jingyi Zhang, Bin She, Teng Su, Yifan Yao, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, and Guyue Liu. 2025. Accelerating Model Training on Ascend Chips: An Industrial Syst...