HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

Cheng Li; Congkun Ai; Da Lei; Guangpeng Zhang; Hanbo Zhang; Haoran Wang; Shihan Xiao; Teng Su; Xuefeng Jin; Zewen Jin

REVIEW 1 major objections 1 minor 30 references

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

HyperParallel-MoE turns MoE operator execution into a static tile-level taskflow to overlap communication with matrix and vector compute on Ascend NPUs.

2026-06-30 15:07 UTC pith:IMOO45BQ

load-bearing objection Hardware-specific MoE scheduler for Ascend that turns AIC/AIV queues into a single-kernel tile taskflow and reports up to 1.58x Dispatch-to-Combine speedup with code released. the 1 major comments →

arxiv 2605.23764 v2 pith:IMOO45BQ submitted 2026-05-22 cs.DC

HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

Zewen Jin , Congkun Ai , Guangpeng Zhang , Hanbo Zhang , Haoran Wang , Shihan Xiao , Da Lei , Xuefeng Jin

show 2 more authors

Teng Su Cheng Li

This is my paper

classification cs.DC

keywords MoE trainingAscend NPUsheterogeneous schedulingtile-level taskflowexpert parallelismcommunication overlapAIC AIV coordination

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a scheduling method for Mixture-of-Experts training that makes better use of the separate matrix-oriented and vector-oriented units inside Ascend NPUs. Standard frameworks launch operators one at a time and leave the hardware's parallel capacity idle. The new approach builds a fixed schedule of small tasks that run communication, matrix work, and vector work together inside a single kernel launch. It keeps existing optimized operators unchanged and reports lower latency in the Dispatch-to-Combine stage of MoE-FFN blocks. A reader would care because faster execution on the same cluster size would let larger MoE models train in less wall-clock time.

Core claim

HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. The framework executes the compiled taskflow within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computa

What carries the argument

The statically scheduled tile-level heterogeneous taskflow that unifies communication and computation under one abstraction and coordinates AIC and AIV queues via event-driven static scheduling.

Load-bearing premise

A statically generated tile-level taskflow can be executed with low runtime overhead while preserving correctness and compatibility with existing optimized operators.

What would settle it

Measure Dispatch-to-Combine MoE-FFN latency on Ascend A3 clusters with and without the HyperParallel-MoE scheduler; if the measured reduction disappears or the added coordination overhead exceeds the overlap gains, the central claim does not hold.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Communication, matrix computation, and vector computation overlap at fine granularity inside one kernel launch.
Existing optimized operators remain unchanged and are still used inside the new schedule.
The latency reduction applies across multiple expert-parallel configurations on Ascend A3 clusters.
The entire MoE-FFN stage runs under a single unified runtime driver rather than repeated host-kernel launches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same static tile scheduling pattern could be tested on other accelerators that expose separate matrix and vector engines with cross-queue synchronization.
If task generation overhead stays low at larger scales, the approach would support training bigger MoE models on fixed-size clusters without extra hardware.
Dynamic re-generation of the tile schedule at runtime could be compared against the static version to check whether adaptability improves results under changing network loads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Hardware-specific MoE scheduler for Ascend that turns AIC/AIV queues into a single-kernel tile taskflow and reports up to 1.58x Dispatch-to-Combine speedup with code released.

read the letter

The main thing here is a practical engineering win on Ascend A3: they replace the usual serialized kernel launches for MoE dispatch/combine/FFN with a statically generated tile-level task graph that runs AIC matrix work and AIV vector/comms work inside one kernel launch. The new pieces are AIV-driven one-sided communication to drop host collectives, dependency-preserving tile generation that keeps the existing optimized operators intact, and event-driven static scheduling across the two queues. They ship the code in the MindSpore stack and show the 1.58x number on DeepSeek-style models across a few expert-parallel configs.

What stands out is that they actually released the implementation and targeted a real production stack rather than a toy prototype. The claim is narrow but concrete: measurable overlap of comm, matmul, and vector ops without rewriting the operators themselves.

The soft spot is the evaluation. The abstract gives the headline number but the description does not spell out baseline details, variance across runs, or exactly how much of the gain comes from the new scheduler versus other tuning. If the full paper has those controls and the numbers hold under the same conditions, the result is usable. If not, the 1.58x is harder to trust for anyone trying to reproduce on the same hardware.

This is for people who already run MoE training on Ascend clusters and need lower latency on the expert-parallel path. It is not a general algorithmic advance. A serious referee should see it because the system is reproducible, the hardware target is current, and the engineering choices are described at the level that matters for implementers. I would send it out rather than desk-reject.

Referee Report

1 major / 1 minor

Summary. The paper presents HyperParallel-MoE, a compilation and scheduling framework for MoE training on Ascend NPUs. It transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources via AIV-driven one-sided communication, dependency-preserving tile task generation, and event-driven static scheduling. The approach executes the taskflow in a unified runtime for fine-grained overlap of communication, matrix computation, and vector computation while preserving existing operators. Implemented in MindSpore/MindFormers and evaluated on DeepSeek-style MoE models on Ascend A3 clusters, it claims up to 1.58x reduction in Dispatch-to-Combine MoE-FFN latency across expert-parallel configurations, with source code released.

Significance. If the speedup claims are robustly supported, the work is significant for showing how static tile-level scheduling can exploit heterogeneous on-chip resources (AIC/AIV) on Ascend NPUs to improve MoE training efficiency beyond serialized kernel execution. The engineering focus on low-overhead static taskflows with operator compatibility is relevant for large-scale AI clusters. The public release of source code is a clear strength, supporting reproducibility in the distributed computing and systems community.

major comments (1)

[§5 (Evaluation)] §5 (Evaluation): The reported 1.58x Dispatch-to-Combine MoE-FFN latency reduction lacks details on baselines (e.g., standard MindSpore MoE execution), exact expert-parallel configurations tested, number of runs, error bars, or measurement methodology. This is load-bearing for the central claim, as it prevents verification that the statically generated tile-level taskflow delivers the speedup with negligible runtime overhead and preserved correctness.

minor comments (1)

The abstract paragraph is lengthy and could be tightened for clarity without losing technical content.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The evaluation details are indeed critical to supporting the central performance claim, and we will strengthen this section accordingly.

read point-by-point responses

Referee: [§5 (Evaluation)] §5 (Evaluation): The reported 1.58x Dispatch-to-Combine MoE-FFN latency reduction lacks details on baselines (e.g., standard MindSpore MoE execution), exact expert-parallel configurations tested, number of runs, error bars, or measurement methodology. This is load-bearing for the central claim, as it prevents verification that the statically generated tile-level taskflow delivers the speedup with negligible runtime overhead and preserved correctness.

Authors: We agree that additional methodological details are necessary for readers to fully verify the reported speedup and the low-overhead nature of the static scheduling. In the revised manuscript we will expand §5 with: (i) an explicit statement that the baseline is unmodified MindSpore/MindFormers MoE execution using the same operators and collective primitives; (ii) the precise expert-parallel configurations (number of experts, EP degree, and model sizes) used for the 1.58× result; (iii) the number of repeated runs and any reported variance or error bars; and (iv) the exact measurement methodology, including how Dispatch-to-Combine latency was isolated, how the single-kernel-launch taskflow was timed, and how functional equivalence to the baseline was confirmed. These additions will be placed in the main evaluation section and will not alter any performance numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a systems/engineering contribution describing a compilation and scheduling framework for MoE on Ascend NPUs. It reports empirical latency reductions from hardware evaluation rather than any mathematical derivation, equations, fitted parameters, or predictions. No load-bearing steps reduce to self-definition, self-citation chains, or renamed inputs; the central claim rests on measured speedups with released code. This is the expected non-finding for an implementation paper without a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work is a systems engineering contribution rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5864 in / 1127 out tokens · 33065 ms · 2026-06-30T15:07:39.833719+00:00 · methodology

0 comments

read the original abstract

Modern Mixture-of-Experts (MoE) models increasingly rely on large-scale AI accelerator clusters for efficient training. Ascend NPUs expose heterogeneous on-chip compute resources, including matrix-oriented AIC units and vector-oriented AIV units with explicit cross-queue synchronization support. However, existing training frameworks largely execute MoE operators in a serialized kernel-by-kernel manner, leaving substantial heterogeneous parallelism underutilized. This paper presents HyperParallel-MoE, a compilation and scheduling framework for MoE training on Ascend NPUs. HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. HyperParallel-MoE further executes the compiled taskflow within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while preserving existing optimized operators. We implement HyperParallel-MoE in the MindSpore and MindFormers stack and evaluate it using DeepSeek-style MoE models on Ascend A3 clusters. Across multiple expert-parallel configurations, HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x, demonstrating that tile-level heterogeneous scheduling can substantially improve MoE training efficiency on modern NPUs. The source code is available at https://gitcode.com/mindspore/hyper-parallel/tree/master/hyper_parallel/core/multicore

Figures

Figures reproduced from arXiv: 2605.23764 by Cheng Li, Congkun Ai, Da Lei, Guangpeng Zhang, Hanbo Zhang, Haoran Wang, Shihan Xiao, Teng Su, Xuefeng Jin, Zewen Jin.

**Figure 1.** Figure 1: Ascend NPU heterogeneous AIC/AIV execution model. are resolved offline. We integrate HyperParallel-MoE into the MindSpore and MindFormers training stack [16, 17] with low code intrusion, while preserving existing optimized implementations of GMM, SwiGLU, and communication operators. We evaluate HyperParallel-MoE using DeepSeek-V3-style MoE models [7] on clusters of Ascend A3 NPUs. Across EP4, EP8, and EP… view at source ↗

**Figure 2.** Figure 2: Forward and backward MoE-FFN operator graph with AIC/AIV mapping. to form the final MoE output. Representative MoE models include DeepSeek-V2 [6], DeepSeek-V3 [7], Mixtral 8×7B [13], and Qwen2.5-MoE [19]. To better support Mixture-of-Experts (MoE) training on A3 NPUs, we examine its computational structure in depth. Consider the MoE feed-forward network (MoE-FFN) as a representative example. Its forward … view at source ↗

**Figure 3.** Figure 3: End-to-end training step time breakdown on Ascend A3. D0 I G⭡0 G⭣0 SG0 Cube Vector Dispatch / Combine Idle GMM_gate / up GMM_down SwiGLU Dispatch Cube Idle Vector GMM_gate / up Idle SwiGLU Idle GMM_down Idle Idle (a) Kernel-by-Kernel Execution (b) Tile-Level AIC/AIV Pipeline G⭡1 G⭡2 G⭣1 G⭣2 SG1SG2 I D1 D2 CB0 CB1 CB2 Combine [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Kernel-by-kernel execution versus tile-level AIC/AIV pipelining. After SwiGLUgrad, GMMgate_grad and GMMw1_grad become independent consumers; backward Combine then returns the resulting input activation gradient [7, 16]. These operators stress different hardware resources. GMM operators mainly use Cube matrix engines, whereas Dispatch, Combine, SwiGLU, activation gradients, and data movement map mostly to … view at source ↗

**Figure 5.** Figure 5: Overview of HyperParallel-MoE. decomposes them into fine-grained tile tasks and organizes these tasks into concurrent execution streams across heterogeneous hardware queues. At a high level, HyperParallel-MoE shifts MoE execution from a kernel-centric model to a taskflow-centric model. During compilation, the framework analyzes operator dependencies, legal tiling strategies, tensor layouts, and hardware… view at source ↗

**Figure 6.** Figure 6: Rank-Aware Task Reordering (RATR). The naive order creates destination-rank hotspots, while RATR rotates each rank’s task order to form a balanced communication pattern. both the activation-gradient GMM and the down-projection weight-gradient GMM consume the dispatched expert activations without depending on each other. If the scheduler executes one GMM branch in its entirety before launching the other, … view at source ↗

**Figure 8.** Figure 8: End-to-end latency for one training step with sampled natural routing. Bar annotations report total step-level speedup over the standard operator-by-operator baseline. Balanced routing [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 7.** Figure 7: Forward/backward Dispatch-to-Combine latency breakdown under balanced routing. Bar annotations report total speedup over the standard operator-by-operator baseline. execution path with full-device operators, full-core exclusive execution, and collective AllToAll communication. For endto-end step latency, the baseline also retains MindSpore’s DVM-level automatic fusion and graph-level execution planning,… view at source ↗

**Figure 9.** Figure 9: SwiGLU+Add cache microbenchmarks under serial and tile-interleaved execution. Left: execution latency. Right: L2 cache hit rate. 6 Microbenchmarks Section 5 reports both Dispatch-to-Combine MoE-FFN module latency and end-to-end training-step latency after communication, computation, synchronization, and ordering optimizations are applied together. This section complements that evaluation with focused mi… view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 9 internal anchors

[1]

Osayamen Aimuyo, Byungsoo Oh, and Rachee Singh. 2025. Flash- MoE: Fast Distributed MoE in a Single Kernel. InAdvances in Neural Information Processing Systems, D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen (Eds.), Vol. 38. Curran Associates, Inc., Red Hook, NY, USA, 100676– 100699.https://proceedings.neurips.cc/paper_file...

work page 2025
[2]

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. 2024. FLUX: Fast Software- based Communication Overlap On GPUs Through Kernel Fusion. arXiv:2406.06858 [cs.LG]https://arxiv.org/abs/2406.06858

work page arXiv 2024
[3]

Yu Cheng, Lei Wang, Yining Shi, Yuqing Xia, Lingxiao Ma, Jilong Xue, Yang Wang, Zhiwen Mo, Feiyang Chen, Fan Yang, Mao Yang, and Zhi Yang. 2025. PipeThreader: Software-Defined Pipelining for Efficient DNN Execution. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, 15 Boston, MA, 767–783.https://www.usen...

work page 2025
[4]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, et al. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of- Experts Language Models. arXiv:2401.06066 [cs.CL]https://arxiv.org/ abs/2401.06066

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

work page
[6]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., Red Hook, NY, USA, 16344–16359. https://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434 [cs.CL]https: //arxiv.org/abs/2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

DeepSeek-AI. 2025. DeepEP.https://github.com/deepseek-ai/DeepEP

work page 2025
[10]

DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million- Token Context Intelligence. Technical report.https://huggingface. co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Ac- cessed May 19, 2026

work page 2026
[11]

DeepSeek-AI. 2026. MegaMoE.https://github.com/deepseek-ai/ DeepGEMM/pull/304. Merged Apr. 17, 2026

work page 2026
[12]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39.http://jmlr.org/papers/v23/21-0998.html

work page 2022
[13]

Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, and Yizhou Shan. 2025. DEEPSERVE: Serverless Large Language Model Serving at Scale. In 2025 USENIX Annual Technical Conference ...

work page 2025
[14]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Men- sch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. 2024. Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping. InProceedings of Machine Learning and Systems, Vol. 6. MLSys, Santa Clara, CA, USA, 13 pages. arXiv:2404.19429 [cs.DC] https://proceedings.mlsys.org/paper_files/paper/2024/file...

work page arXiv 2024
[16]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

work page 2024
[17]

MindSpore Contributors. 2020. MindSpore.https://www.mindspore. cn/

work page 2020
[18]

MindSpore Contributors. 2024. MindSpore Transformers.https://www. mindspore.cn/mindformers/docs/en/master/mindformers.html

work page 2024
[19]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Granger, Phil Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Princi- ples. Association for Computing Machinery, Huntsville, ON, Canada, 15 pages. doi:10.1145/3...

work page doi:10.1145/3341301.3359646 2019
[20]

Qwen Team. 2024. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open- Review.net, Toulo...

work page 2017
[22]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019
[23]

Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xi- aosong Ma, and Cheng Li. 2025. DHeLlam: General-Purpose, Auto- matic Micro-Batch Co-Execution for Distributed LLM Training. In 2025 IEEE 43rd International Conference on Computer Design (ICCD). 70–78. doi:10.1109/ICCD65941.2025.00017

work page doi:10.1109/iccd65941.2025.00017 2025
[24]

Jinwu Yang, Jiaan Wu, Zedong Liu, Xinyang Ma, Hairui Zhao, Yida Gu, Yuanhong Huang, Xingchen Liu, Wenjing Huang, Zheng Wei, Jing Xing, Yili Ma, Qingyi Zhang, Baoyi An, Zhongzhe Hu, Shaoteng Liu, Xia Zhu, Jiaxun Lu, Guangming Tan, and Dingwen Tao. 2026. ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs. arXiv:2604.03298 [c...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wen- lei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li- Wen Chang, Quan Chen, and Xin Liu. 2025. COMET: Fine- grained Computation-communication Overlapping for Mixture-of- Experts. InProceedings of Machine Learning and Systems, M. Zaharia, G. Joshi, and Y. Lin (Eds.), Vol. 7. MLSys, Santa C...

work page 2025
[26]

Chenggang Zhao, Zhean Xu, Liang Zhao, Jiashi Li, Chenhao Xu, Anyi Xu, Shengyu Liu, Kexing Zhou, and Kuai Yu. 2025. DeepGEMM: clean and efficient BLAS kernel library on GPU.https://github.com/ deepseek-ai/DeepGEMM

work page 2025
[27]

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chen- hui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yi- fan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu. 2025. Triton-distributed: Programming Overlap- ping Kernels on Distributed AI Syst...

work page arXiv 2025
[28]

Size Zheng, Xuegui Zheng, Li-wen Chang, and Jidong Zhai. 2026. UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training. arXiv:2604.19241 [cs.DC]https://arxiv.org/abs/2604.19241

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, Yanlin Liu, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, Wanchun Dou, Guihai Chen, and Chen Tian. 2025. Squeezing Operator Performance Potential for the Ascend Architecture. InProceedings of the 30th ACM International Conference...

work page doi:10.1145/3676641.3716243 2025
[30]

Yuhang Zhou, Zibo Wang, Zhibin Wang, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Bingqiang Wang, Yonghong Tian, Yan Zhang, Hui Wang, Fuchun Wei, Boquan Sun, Jingyi Zhang, Bin She, Teng Su, Yifan Yao, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, and Guyue Liu. 2025. Accelerating Model Training on Ascend Chips: An Industrial Syst...

work page 2025

[1] [1]

Osayamen Aimuyo, Byungsoo Oh, and Rachee Singh. 2025. Flash- MoE: Fast Distributed MoE in a Single Kernel. InAdvances in Neural Information Processing Systems, D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen (Eds.), Vol. 38. Curran Associates, Inc., Red Hook, NY, USA, 100676– 100699.https://proceedings.neurips.cc/paper_file...

work page 2025

[2] [2]

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. 2024. FLUX: Fast Software- based Communication Overlap On GPUs Through Kernel Fusion. arXiv:2406.06858 [cs.LG]https://arxiv.org/abs/2406.06858

work page arXiv 2024

[3] [3]

Yu Cheng, Lei Wang, Yining Shi, Yuqing Xia, Lingxiao Ma, Jilong Xue, Yang Wang, Zhiwen Mo, Feiyang Chen, Fan Yang, Mao Yang, and Zhi Yang. 2025. PipeThreader: Software-Defined Pipelining for Efficient DNN Execution. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, 15 Boston, MA, 767–783.https://www.usen...

work page 2025

[4] [4]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, et al. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of- Experts Language Models. arXiv:2401.06066 [cs.CL]https://arxiv.org/ abs/2401.06066

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

work page

[6] [6]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., Red Hook, NY, USA, 16344–16359. https://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434 [cs.CL]https: //arxiv.org/abs/2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

DeepSeek-AI. 2025. DeepEP.https://github.com/deepseek-ai/DeepEP

work page 2025

[10] [10]

DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million- Token Context Intelligence. Technical report.https://huggingface. co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Ac- cessed May 19, 2026

work page 2026

[11] [11]

DeepSeek-AI. 2026. MegaMoE.https://github.com/deepseek-ai/ DeepGEMM/pull/304. Merged Apr. 17, 2026

work page 2026

[12] [12]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39.http://jmlr.org/papers/v23/21-0998.html

work page 2022

[13] [13]

Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, and Yizhou Shan. 2025. DEEPSERVE: Serverless Large Language Model Serving at Scale. In 2025 USENIX Annual Technical Conference ...

work page 2025

[14] [14]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Men- sch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. 2024. Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping. InProceedings of Machine Learning and Systems, Vol. 6. MLSys, Santa Clara, CA, USA, 13 pages. arXiv:2404.19429 [cs.DC] https://proceedings.mlsys.org/paper_files/paper/2024/file...

work page arXiv 2024

[16] [16]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

work page 2024

[17] [17]

MindSpore Contributors. 2020. MindSpore.https://www.mindspore. cn/

work page 2020

[18] [18]

MindSpore Contributors. 2024. MindSpore Transformers.https://www. mindspore.cn/mindformers/docs/en/master/mindformers.html

work page 2024

[19] [19]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Granger, Phil Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Princi- ples. Association for Computing Machinery, Huntsville, ON, Canada, 15 pages. doi:10.1145/3...

work page doi:10.1145/3341301.3359646 2019

[20] [20]

Qwen Team. 2024. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open- Review.net, Toulo...

work page 2017

[22] [22]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019

[23] [23]

Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xi- aosong Ma, and Cheng Li. 2025. DHeLlam: General-Purpose, Auto- matic Micro-Batch Co-Execution for Distributed LLM Training. In 2025 IEEE 43rd International Conference on Computer Design (ICCD). 70–78. doi:10.1109/ICCD65941.2025.00017

work page doi:10.1109/iccd65941.2025.00017 2025

[24] [24]

Jinwu Yang, Jiaan Wu, Zedong Liu, Xinyang Ma, Hairui Zhao, Yida Gu, Yuanhong Huang, Xingchen Liu, Wenjing Huang, Zheng Wei, Jing Xing, Yili Ma, Qingyi Zhang, Baoyi An, Zhongzhe Hu, Shaoteng Liu, Xia Zhu, Jiaxun Lu, Guangming Tan, and Dingwen Tao. 2026. ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs. arXiv:2604.03298 [c...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wen- lei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li- Wen Chang, Quan Chen, and Xin Liu. 2025. COMET: Fine- grained Computation-communication Overlapping for Mixture-of- Experts. InProceedings of Machine Learning and Systems, M. Zaharia, G. Joshi, and Y. Lin (Eds.), Vol. 7. MLSys, Santa C...

work page 2025

[26] [26]

Chenggang Zhao, Zhean Xu, Liang Zhao, Jiashi Li, Chenhao Xu, Anyi Xu, Shengyu Liu, Kexing Zhou, and Kuai Yu. 2025. DeepGEMM: clean and efficient BLAS kernel library on GPU.https://github.com/ deepseek-ai/DeepGEMM

work page 2025

[27] [27]

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chen- hui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yi- fan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu. 2025. Triton-distributed: Programming Overlap- ping Kernels on Distributed AI Syst...

work page arXiv 2025

[28] [28]

Size Zheng, Xuegui Zheng, Li-wen Chang, and Jidong Zhai. 2026. UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training. arXiv:2604.19241 [cs.DC]https://arxiv.org/abs/2604.19241

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, Yanlin Liu, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, Wanchun Dou, Guihai Chen, and Chen Tian. 2025. Squeezing Operator Performance Potential for the Ascend Architecture. InProceedings of the 30th ACM International Conference...

work page doi:10.1145/3676641.3716243 2025

[30] [30]

Yuhang Zhou, Zibo Wang, Zhibin Wang, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Bingqiang Wang, Yonghong Tian, Yan Zhang, Hui Wang, Fuchun Wei, Boquan Sun, Jingyi Zhang, Bin She, Teng Su, Yifan Yao, Chunsheng Li, Ziyang Zhang, Yaoyuan Wang, Bin Zhou, and Guyue Liu. 2025. Accelerating Model Training on Ascend Chips: An Industrial Syst...

work page 2025