HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

Cheng Li; Congkun Ai; Da Lei; Guangpeng Zhang; Hanbo Zhang; Haoran Wang; Shihan Xiao; Teng Su; Xuefeng Jin; Zewen Jin

arxiv: 2605.23764 · v1 · pith:IMOO45BQnew · submitted 2026-05-22 · 💻 cs.DC

HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

Zewen Jin , Congkun Ai , Guangpeng Zhang , Hanbo Zhang , Haoran Wang , Shihan Xiao , Da Lei , Xuefeng Jin

show 2 more authors

Teng Su Cheng Li

This is my paper

Pith reviewed 2026-05-25 02:47 UTC · model grok-4.3

classification 💻 cs.DC

keywords Mixture-of-ExpertsAscend NPUheterogeneous schedulingMoE trainingtile-level taskflowexpert parallelismAIV communication

0 comments

The pith

HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x on Ascend NPUs by turning serialized operators into a tile-level taskflow across matrix and vector units inside one kernel launch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HyperParallel-MoE to address underutilized heterogeneous resources on Ascend NPUs, where matrix-oriented AIC units and vector-oriented AIV units sit idle during serialized MoE kernel execution. It converts operator-level MoE work into a statically scheduled tile-level taskflow that unifies communication and computation under one abstraction. Three techniques enable this: AIV-driven one-sided communication that removes host-side collectives, dependency-preserving tile task generation, and event-driven static scheduling for cross-queue coordination. The entire taskflow then runs concurrently on AIC and AIV workers from a single kernel launch, preserving existing optimized operators. Evaluation on DeepSeek-style models across expert-parallel setups on A3 clusters shows the latency reduction.

Core claim

HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. The compiled taskflow executes within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while the

What carries the argument

The tile-level heterogeneous taskflow spanning AIC matrix and AIV vector units, built from AIV-driven one-sided communication, dependency-preserving tile task generation, and event-driven static scheduling, executed inside a single kernel launch.

If this is right

Dispatch-to-Combine MoE-FFN latency drops by up to 1.58x across multiple expert-parallel configurations.
Fine-grained overlap occurs among communication, matrix computation, and vector computation.
Existing optimized operators remain unchanged inside the unified runtime.
The approach integrates into the MindSpore and MindFormers stack for practical MoE training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tile-task abstraction could be applied to other MoE phases such as routing or all-to-all beyond Dispatch-to-Combine.
Compiler-generated taskflows of this style might transfer to other NPUs that expose separate matrix and vector queues with event synchronization.
If single-kernel-launch overhead stays low, the technique could shorten wall-clock time for full MoE pre-training runs without altering model architecture.

Load-bearing premise

The assumption that AIV-driven one-sided communication, dependency-preserving tile task generation, and event-driven static scheduling can be realized inside a single kernel launch without correctness issues or substantial runtime overhead while preserving existing optimized operators.

What would settle it

An experiment on Ascend A3 clusters running DeepSeek-style MoE models that measures either incorrect outputs from dependency violations or no net latency gain once single-kernel-launch overhead is included.

Figures

Figures reproduced from arXiv: 2605.23764 by Cheng Li, Congkun Ai, Da Lei, Guangpeng Zhang, Hanbo Zhang, Haoran Wang, Shihan Xiao, Teng Su, Xuefeng Jin, Zewen Jin.

**Figure 1.** Figure 1: Ascend NPU heterogeneous AIC/AIV execution model. are resolved offline. We integrate HyperParallel-MoE into the MindSpore and MindFormers training stack [16, 17] with low code intrusion, while preserving existing optimized implementations of GMM, SwiGLU, and communication operators. We evaluate HyperParallel-MoE using DeepSeek-V3-style MoE models [7] on clusters of Ascend A3 NPUs. Across EP4, EP8, and EP… view at source ↗

**Figure 2.** Figure 2: Forward and backward MoE-FFN operator graph with AIC/AIV mapping. to form the final MoE output. Representative MoE models include DeepSeek-V2 [6], DeepSeek-V3 [7], Mixtral 8×7B [13], and Qwen2.5-MoE [19]. To better support Mixture-of-Experts (MoE) training on A3 NPUs, we examine its computational structure in depth. Consider the MoE feed-forward network (MoE-FFN) as a representative example. Its forward … view at source ↗

**Figure 3.** Figure 3: End-to-end training step time breakdown on Ascend A3. D0 I G⭡0 G⭣0 SG0 Cube Vector Dispatch / Combine Idle GMM_gate / up GMM_down SwiGLU Dispatch Cube Idle Vector GMM_gate / up Idle SwiGLU Idle GMM_down Idle Idle (a) Kernel-by-Kernel Execution (b) Tile-Level AIC/AIV Pipeline G⭡1 G⭡2 G⭣1 G⭣2 SG1SG2 I D1 D2 CB0 CB1 CB2 Combine [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Kernel-by-kernel execution versus tile-level AIC/AIV pipelining. After SwiGLUgrad, GMMgate_grad and GMMw1_grad become independent consumers; backward Combine then returns the resulting input activation gradient [7, 16]. These operators stress different hardware resources. GMM operators mainly use Cube matrix engines, whereas Dispatch, Combine, SwiGLU, activation gradients, and data movement map mostly to … view at source ↗

**Figure 5.** Figure 5: Overview of HyperParallel-MoE. decomposes them into fine-grained tile tasks and organizes these tasks into concurrent execution streams across heterogeneous hardware queues. At a high level, HyperParallel-MoE shifts MoE execution from a kernel-centric model to a taskflow-centric model. During compilation, the framework analyzes operator dependencies, legal tiling strategies, tensor layouts, and hardware… view at source ↗

**Figure 6.** Figure 6: Rank-Aware Task Reordering (RATR). The naive order creates destination-rank hotspots, while RATR rotates each rank’s task order to form a balanced communication pattern. both the activation-gradient GMM and the down-projection weight-gradient GMM consume the dispatched expert activations without depending on each other. If the scheduler executes one GMM branch in its entirety before launching the other, … view at source ↗

**Figure 8.** Figure 8: End-to-end latency for one training step with sampled natural routing. Bar annotations report total step-level speedup over the standard operator-by-operator baseline. Balanced routing [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 7.** Figure 7: Forward/backward Dispatch-to-Combine latency breakdown under balanced routing. Bar annotations report total speedup over the standard operator-by-operator baseline. execution path with full-device operators, full-core exclusive execution, and collective AllToAll communication. For endto-end step latency, the baseline also retains MindSpore’s DVM-level automatic fusion and graph-level execution planning,… view at source ↗

**Figure 9.** Figure 9: SwiGLU+Add cache microbenchmarks under serial and tile-interleaved execution. Left: execution latency. Right: L2 cache hit rate. 6 Microbenchmarks Section 5 reports both Dispatch-to-Combine MoE-FFN module latency and end-to-end training-step latency after communication, computation, synchronization, and ordering optimizations are applied together. This section complements that evaluation with focused mi… view at source ↗

read the original abstract

Modern Mixture-of-Experts (MoE) models increasingly rely on large-scale AI accelerator clusters for efficient training. Ascend NPUs expose heterogeneous on-chip compute resources, including matrix-oriented AIC units and vector-oriented AIV units with explicit cross-queue synchronization support. However, existing training frameworks largely execute MoE operators in a serialized kernel-by-kernel manner, leaving substantial heterogeneous parallelism underutilized. This paper presents HyperParallel-MoE, a compilation and scheduling framework for MoE training on Ascend NPUs. HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. HyperParallel-MoE further executes the compiled taskflow within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while preserving existing optimized operators. We implement HyperParallel-MoE in the MindSpore and MindFormers stack and evaluate it using DeepSeek-style MoE models on Ascend A3 clusters. Across multiple expert-parallel configurations, HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x, demonstrating that tile-level heterogeneous scheduling can substantially improve MoE training efficiency on modern NPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract sketches a tile-level scheduler for MoE on Ascend NPUs that overlaps AIC/AIV work and one-sided comms, but supplies no methods, baselines, or data to support the 1.58x claim.

read the letter

The paper's main move is to take MoE dispatch-combine-FFN execution and break it into statically scheduled tiles that run across the matrix (AIC) and vector (AIV) queues on Ascend hardware. It adds AIV-driven one-sided communication to drop host collectives, a dependency-preserving tile generator, and event-driven static scheduling so everything fits inside one kernel launch. That combination is not in the cited prior work on Ascend, so the specific engineering for this platform is new. The approach also keeps existing optimized operators intact, which is practical. The abstract states a 1.58x reduction in Dispatch-to-Combine latency across expert-parallel configs on A3 clusters, which would matter for anyone training large MoE models on these NPUs. The evidence for that number is missing: no baselines, no error bars, no description of how tiles were generated or how overlap was measured. The central assumption—that the cross-queue coordination and single-launch runtime add negligible overhead and preserve correctness—cannot be checked from what is here. If the full paper shows reproducible experiments with clear controls, the claim becomes testable; right now it is not. This is for readers who care about NPU-specific scheduling or heterogeneous on-chip overlap in large-model training. It is narrow enough that most groups would not cite it unless they also target Ascend, but the problem it attacks is real. The work deserves a serious referee because the idea is concrete and the hardware target is current, even if the current write-up leaves the performance numbers unverified.

Referee Report

0 major / 1 minor

Summary. The paper introduces HyperParallel-MoE, a compilation and scheduling framework for Mixture-of-Experts (MoE) training on Ascend NPUs. It transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow that spans AIC matrix units and AIV vector units. Key techniques include AIV-driven one-sided communication to remove host-side synchronization, dependency-preserving tile task generation to unify communication and computation, and event-driven static scheduling for cross-queue coordination. The framework executes the taskflow in a unified runtime inside a single kernel launch to enable fine-grained overlap of communication, matrix, and vector computation while preserving existing optimized operators. It is implemented in the MindSpore/MindFormers stack and evaluated on DeepSeek-style MoE models on Ascend A3 clusters, reporting up to 1.58x reduction in Dispatch-to-Combine MoE-FFN latency across multiple expert-parallel configurations.

Significance. If the empirical results hold, the work is significant for demonstrating how to exploit heterogeneous on-chip resources (AIC/AIV) and explicit synchronization primitives on Ascend NPUs for MoE training, an increasingly important workload. The approach of compiling to a tile-level taskflow with static scheduling and single-kernel execution offers a concrete method to improve efficiency without altering existing high-performance operators. This is relevant to the distributed systems and high-performance computing community working on accelerator-specific optimizations for large models.

minor comments (1)

The abstract states a 1.58x latency reduction but the provided text does not include the experimental section; the manuscript should ensure that § on evaluation supplies full baselines, configurations, error bars, and data-exclusion criteria so the central claim can be verified.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for recognizing the potential significance of tile-level heterogeneous scheduling for MoE training on Ascend NPUs. The recommendation is listed as uncertain, but the report contains no specific major comments to address.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an engineering framework for tile-level heterogeneous scheduling of MoE operators on Ascend NPUs and reports measured latency reductions (up to 1.58x) from experiments on DeepSeek-style models. No mathematical derivation chain, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text or abstract. The central claim is an empirical outcome of the implemented scheduling techniques rather than a result that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an engineering framework that relies on existing hardware features of Ascend NPUs and prior optimized operators; no free parameters, new axioms, or invented entities are introduced or fitted.

pith-pipeline@v0.9.0 · 5838 in / 1062 out tokens · 17580 ms · 2026-05-25T02:47:29.580812+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 15 canonical work pages · 9 internal anchors

[1]

Osayamen Aimuyo, Byungsoo Oh, and Rachee Singh. 2025. Flash- MoE: Fast Distributed MoE in a Single Kernel. InAdvances in Neural Information Processing Systems, D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen (Eds.), Vol. 38. Curran Associates, Inc., Red Hook, NY, USA, 100676– 100699.https://proceedings.neurips.cc/paper_file...

2025
[2]

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. 2024. FLUX: Fast Software- based Communication Overlap On GPUs Through Kernel Fusion. arXiv:2406.06858 [cs.LG]https://arxiv.org/abs/2406.06858

work page arXiv 2024
[3]

Yu Cheng, Lei Wang, Yining Shi, Yuqing Xia, Lingxiao Ma, Jilong Xue, Yang Wang, Zhiwen Mo, Feiyang Chen, Fan Yang, Mao Yang, and Zhi Yang. 2025. PipeThreader: Software-Defined Pipelining for Efficient DNN Execution. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, 15 Boston, MA, 767–783.https://www.usen...

2025
[4]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, et al. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of- Experts Language Models. arXiv:2401.06066 [cs.CL]https://arxiv.org/ abs/2401.06066

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
[6]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., Red Hook, NY, USA, 16344–16359. https://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434 [cs.CL]https: //arxiv.org/abs/2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

DeepSeek-AI. 2025. DeepEP.https://github.com/deepseek-ai/DeepEP

2025
[10]

DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million- Token Context Intelligence. Technical report.https://huggingface. co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Ac- cessed May 19, 2026

2026
[11]

DeepSeek-AI. 2026. MegaMoE.https://github.com/deepseek-ai/ DeepGEMM/pull/304. Merged Apr. 17, 2026

2026
[12]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39.http://jmlr.org/papers/v23/21-0998.html

2022
[13]

Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, and Yizhou Shan. 2025. DEEPSERVE: Serverless Large Language Model Serving at Scale. In 2025 USENIX Annual Technical Conference ...

2025
[14]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Men- sch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. 2024. Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping. InProceedings of Machine Learning and Systems, Vol. 6. MLSys, Santa Clara, CA, USA, 13 pages. arXiv:2404.19429 [cs.DC] https://proceedings.mlsys.org/paper_files/paper/2024/file...

work page arXiv 2024
[16]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

2024
[17]

MindSpore Contributors. 2020. MindSpore.https://www.mindspore. cn/

2020
[18]

MindSpore Contributors. 2024. MindSpore Transformers.https://www. mindspore.cn/mindformers/docs/en/master/mindformers.html

2024
[19]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Granger, Phil Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Princi- ples. Association for Computing Machinery, Huntsville, ON, Canada, 15 pages. doi:10.1145/3...

work page doi:10.1145/3341301.3359646 2019
[20]

Qwen Team. 2024. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open- Review.net, Toulo...

2017
[22]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019
[23]

Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xi- aosong Ma, and Cheng Li. 2025. DHeLlam: General-Purpose, Auto- matic Micro-Batch Co-Execution for Distributed LLM Training. In 2025 IEEE 43rd International Conference on Computer Design (ICCD). 70–78. doi:10.1109/ICCD65941.2025.00017

work page doi:10.1109/iccd65941.2025.00017 2025
[24]

Jinwu Yang, Jiaan Wu, Zedong Liu, Xinyang Ma, Hairui Zhao, Yida Gu, Yuanhong Huang, Xingchen Liu, Wenjing Huang, Zheng Wei, Jing Xing, Yili Ma, Qingyi Zhang, Baoyi An, Zhongzhe Hu, Shaoteng Liu, Xia Zhu, Jiaxun Lu, Guangming Tan, and Dingwen Tao. 2026. ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs. arXiv:2604.03298 [c...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wen- lei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li- Wen Chang, Quan Chen, and Xin Liu. 2025. COMET: Fine- grained Computation-communication Overlapping for Mixture-of- Experts. InProceedings of Machine Learning and Systems, M. Zaharia, G. Joshi, and Y. Lin (Eds.), Vol. 7. MLSys, Santa C...

2025
[26]

Chenggang Zhao, Zhean Xu, Liang Zhao, Jiashi Li, Chenhao Xu, Anyi Xu, Shengyu Liu, Kexing Zhou, and Kuai Yu. 2025. DeepGEMM: clean and efficient BLAS kernel library on GPU.https://github.com/ deepseek-ai/DeepGEMM

2025
[27]

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chen- hui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yi- fan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu. 2025. Triton-distributed: Programming Overlap- ping Kernels on Distributed AI Syst...

work page arXiv 2025
[28]

Size Zheng, Xuegui Zheng, Li-wen Chang, and Jidong Zhai. 2026. UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training. arXiv:2604.19241 [cs.DC]https://arxiv.org/abs/2604.19241

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, Yanlin Liu, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, Wanchun Dou, Guihai Chen, and Chen Tian. 2025. Squeezing Operator Performance Potential for the Ascend Architecture. InProceedings of the 30th ACM International Conference...

work page doi:10.1145/3676641.3716243 2025
[30]

Yuhang Zhou, Zibo Wang, Zhibin Wang, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Bingqiang Wang, Yonghong Tian, Yan Zhang, Hui Wang, Fuchun Wei, Boquan Sun, Jingyi Zhang, Bin She, Teng Su, Yifan Yao, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, and Guyue Liu. 2025. Accelerating Model Training on Ascend Chips: An Industrial Syst...

2025

[1] [1]

Osayamen Aimuyo, Byungsoo Oh, and Rachee Singh. 2025. Flash- MoE: Fast Distributed MoE in a Single Kernel. InAdvances in Neural Information Processing Systems, D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen (Eds.), Vol. 38. Curran Associates, Inc., Red Hook, NY, USA, 100676– 100699.https://proceedings.neurips.cc/paper_file...

2025

[2] [2]

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. 2024. FLUX: Fast Software- based Communication Overlap On GPUs Through Kernel Fusion. arXiv:2406.06858 [cs.LG]https://arxiv.org/abs/2406.06858

work page arXiv 2024

[3] [3]

Yu Cheng, Lei Wang, Yining Shi, Yuqing Xia, Lingxiao Ma, Jilong Xue, Yang Wang, Zhiwen Mo, Feiyang Chen, Fan Yang, Mao Yang, and Zhi Yang. 2025. PipeThreader: Software-Defined Pipelining for Efficient DNN Execution. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, 15 Boston, MA, 767–783.https://www.usen...

2025

[4] [4]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, et al. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of- Experts Language Models. arXiv:2401.06066 [cs.CL]https://arxiv.org/ abs/2401.06066

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

[6] [6]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., Red Hook, NY, USA, 16344–16359. https://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434 [cs.CL]https: //arxiv.org/abs/2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

DeepSeek-AI. 2025. DeepEP.https://github.com/deepseek-ai/DeepEP

2025

[10] [10]

DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million- Token Context Intelligence. Technical report.https://huggingface. co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Ac- cessed May 19, 2026

2026

[11] [11]

DeepSeek-AI. 2026. MegaMoE.https://github.com/deepseek-ai/ DeepGEMM/pull/304. Merged Apr. 17, 2026

2026

[12] [12]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39.http://jmlr.org/papers/v23/21-0998.html

2022

[13] [13]

Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, and Yizhou Shan. 2025. DEEPSERVE: Serverless Large Language Model Serving at Scale. In 2025 USENIX Annual Technical Conference ...

2025

[14] [14]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Men- sch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. 2024. Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping. InProceedings of Machine Learning and Systems, Vol. 6. MLSys, Santa Clara, CA, USA, 13 pages. arXiv:2404.19429 [cs.DC] https://proceedings.mlsys.org/paper_files/paper/2024/file...

work page arXiv 2024

[16] [16]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

2024

[17] [17]

MindSpore Contributors. 2020. MindSpore.https://www.mindspore. cn/

2020

[18] [18]

MindSpore Contributors. 2024. MindSpore Transformers.https://www. mindspore.cn/mindformers/docs/en/master/mindformers.html

2024

[19] [19]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Granger, Phil Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Princi- ples. Association for Computing Machinery, Huntsville, ON, Canada, 15 pages. doi:10.1145/3...

work page doi:10.1145/3341301.3359646 2019

[20] [20]

Qwen Team. 2024. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open- Review.net, Toulo...

2017

[22] [22]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019

[23] [23]

Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xi- aosong Ma, and Cheng Li. 2025. DHeLlam: General-Purpose, Auto- matic Micro-Batch Co-Execution for Distributed LLM Training. In 2025 IEEE 43rd International Conference on Computer Design (ICCD). 70–78. doi:10.1109/ICCD65941.2025.00017

work page doi:10.1109/iccd65941.2025.00017 2025

[24] [24]

Jinwu Yang, Jiaan Wu, Zedong Liu, Xinyang Ma, Hairui Zhao, Yida Gu, Yuanhong Huang, Xingchen Liu, Wenjing Huang, Zheng Wei, Jing Xing, Yili Ma, Qingyi Zhang, Baoyi An, Zhongzhe Hu, Shaoteng Liu, Xia Zhu, Jiaxun Lu, Guangming Tan, and Dingwen Tao. 2026. ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs. arXiv:2604.03298 [c...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wen- lei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li- Wen Chang, Quan Chen, and Xin Liu. 2025. COMET: Fine- grained Computation-communication Overlapping for Mixture-of- Experts. InProceedings of Machine Learning and Systems, M. Zaharia, G. Joshi, and Y. Lin (Eds.), Vol. 7. MLSys, Santa C...

2025

[26] [26]

Chenggang Zhao, Zhean Xu, Liang Zhao, Jiashi Li, Chenhao Xu, Anyi Xu, Shengyu Liu, Kexing Zhou, and Kuai Yu. 2025. DeepGEMM: clean and efficient BLAS kernel library on GPU.https://github.com/ deepseek-ai/DeepGEMM

2025

[27] [27]

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chen- hui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yi- fan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu. 2025. Triton-distributed: Programming Overlap- ping Kernels on Distributed AI Syst...

work page arXiv 2025

[28] [28]

Size Zheng, Xuegui Zheng, Li-wen Chang, and Jidong Zhai. 2026. UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training. arXiv:2604.19241 [cs.DC]https://arxiv.org/abs/2604.19241

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, Yanlin Liu, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, Wanchun Dou, Guihai Chen, and Chen Tian. 2025. Squeezing Operator Performance Potential for the Ascend Architecture. InProceedings of the 30th ACM International Conference...

work page doi:10.1145/3676641.3716243 2025

[30] [30]

Yuhang Zhou, Zibo Wang, Zhibin Wang, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Bingqiang Wang, Yonghong Tian, Yan Zhang, Hui Wang, Fuchun Wei, Boquan Sun, Jingyi Zhang, Bin She, Teng Su, Yifan Yao, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, and Guyue Liu. 2025. Accelerating Model Training on Ascend Chips: An Industrial Syst...

2025