pith. sign in

arxiv: 2605.22428 · v1 · pith:45XZOJV4new · submitted 2026-05-21 · 💻 cs.DC

Exploiting Multicast for Accelerating Collective Communication

Pith reviewed 2026-05-22 04:04 UTC · model grok-4.3

classification 💻 cs.DC
keywords collective communicationmulticastmany-to-many transmissionlatency reductionAllGatherAlltoAllAI model trainingAscend NPU
0
0 comments X

The pith

MultiWrite adopts multicast principles to remove redundant packet copies in many-to-many collective communications, directly lowering operator latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that many-to-many operations such as AllGather and AlltoAll can be made faster by replacing unicast writes with a transmission method that avoids sending duplicate data across links. Current implementations congest network bottlenecks because each receiver gets its own copy of the same data. MultiWrite applies multicast ideas but solves the management overhead and compatibility problems that have blocked multicast in AI systems. The result is a semantic that cuts the number of packets transmitted and therefore shortens the time for each collective operator. Demonstrations on production Ascend NPUs show this approach yields measurable latency gains in real workloads.

Core claim

MultiWrite is a novel many-to-many transmission semantic that eliminates redundant packets by adopting multicast principles while addressing critical limitations of traditional multicast for AI workloads. These limitations include heavy management plane overhead and ecosystem compatibility issues. Implemented on Ascend NPUs, the approach produces collective operators whose latency is reduced by up to 33 percent in long-term stress tests on commercially deployed devices.

What carries the argument

MultiWrite, a many-to-many transmission semantic that transmits each data item once to multiple receivers using multicast principles instead of unicast duplication.

If this is right

  • AllGather and AlltoAll operators transmit fewer packets and finish faster when they use the MultiWrite semantic.
  • Network links carry only one copy of each data item instead of one copy per receiver.
  • Collective communication latency drops without requiring changes to the surrounding AI training software stack.
  • End-to-end training and inference times improve because the communication phase no longer dominates as heavily.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same redundancy-removal idea could be ported to other accelerator interconnects that support multicast-like primitives.
  • If integrated into standard collective libraries, the technique would lower communication costs for any framework running large-model workloads.
  • Further work might combine MultiWrite with topology-aware scheduling to reduce contention on shared network resources.

Load-bearing premise

The critical limitations of traditional multicast can be fixed for AI workloads without creating new performance or compatibility problems.

What would settle it

A head-to-head measurement on the same Ascend NPU cluster in which MultiWrite-based AllGather or AlltoAll operators show no latency reduction or introduce compatibility failures compared with standard unicast implementations.

Figures

Figures reproduced from arXiv: 2605.22428 by Chao Xu, Chihyung Wang, Guoxin Qian, Jingbin Zhou, Xu Zhang, Yufeng Yao, Yuyan Wu, Zihang Luo.

Figure 1
Figure 1. Figure 1: In typical CLOS topologies, multicast improves bandwidth for single senders but cannot enhance effective end-to-end bandwidth in collective communication, where bottlenecks exist on both uplink and downlink. 2.3.1 Bandwidth reduction does not equate to latency reduction. Consider the topology illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: Full-Mesh topology with two TP domains, leaving cross-domain links unused. Middle: Unicast multi-path leveraging utilizes cross-domain links but introduces redundant transmissions. Right: Multicast scheme exploits cross-domain links without redundancy, achieving higher effective bandwidth. 3 Huawei Proprietary - Restricted Distribution 1 8 4 3 6 2 7 5 Server 1 1 8 4 3 6 2 7 5 Server 2 SW A A A B B B … view at source ↗
Figure 3
Figure 3. Figure 3: Multicast removes redundant data transfers on bandwidth-constrained inter-server links, thus improving effective communication efficiency. are linked through a CLOS network fabric. The cross-server bandwidth is intentionally oversubscribed relative to the intra-server bandwidth, so the available bandwidth from an NPU to another NPU inside the same server is markedly higher than that to an NPU in a differen… view at source ↗
Figure 4
Figure 4. Figure 4: The sender embeds metadata indicating destination node addresses into each outgoing packet. Upon receiving the packet, relay nodes parse the carried metadata and perform packet replication and forwarding according to preconfigured mapping rules. as a new transaction operation with a new TAOpcode. Packets associated with this new opcode are appended with meta￾data that indicates the complete set of multicas… view at source ↗
Figure 5
Figure 5. Figure 5: The position of MultiWrite modules in the system stack, where colored components denote the newly designed modules in this work. MultiWrite is built as an extension on top of the exist￾ing liburma.so library, as shown in 5. It consists of four core modular components that jointly realize the complete semantic. • The cs_ini module runs on every node that uses this semantic and performs essential initializat… view at source ↗
Figure 6
Figure 6. Figure 6: AllGather latency corresponding to three schemes. The results are collected from nearly 1000 iterations of on￾line stress tests for each AllGather communication operator. The AllGather operator built upon MultiWrite achieves the lowest and most stable latency [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end latency of AlltoAll operators under different batch sizes. (a) Decode phase with typical sizes of 64 and 128; (b) Prefill phase with typical sizes of 1k and 2k. The MultiWrite-based AlltoAll operator achieves more significant performance gains under large batch sizes. AlltoAll.AlltoAll dispatch operators are primarily adopted in MoE workloads, where the separation of prefill phase and decode pha… view at source ↗
Figure 7
Figure 7. Figure 7: End-to-end latency of AllGather operators under different message sizes. AllGather. Empirically, per-rank message sizes for All￾Gather range from a few megabytes to several hundred megabytes. We therefore test message sizes from 256 KB to 200 MB [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: shows the result. MultiWrite brings a moderate increase in AICPU usage compared with the baseline. It is worth noting that the computing capability of AICPU is rel￾atively limited. In practical deployment, existing workloads generally bypass AICPU for high-performance data-plane 0 1 2 3 AICPU Usage (%) AG (Baseline) 0 5000 10000 15000 20000 25000 Timeline 0 5 10 15 AICPU Usage (%) AG with MultiWrite [PITH… view at source ↗
read the original abstract

Reducing collective communication latency is a critical goal for large model training and inference in both academia and industry. Many-to-many communications, such as AllGather and AlltoAll (dispatch), are core components of modern parallelization strategies. State-of-the-art implementations of these communications rely on unicast-based writes and transmit duplicate copies of the same data across physical links for multiple receivers. This redundant transmission congests network bottlenecks and degrades end-to-end latency. We present MultiWrite, a novel many-to-many transmission semantic that eliminates redundant packets to directly reduce operator latency. MultiWrite adopts multicast principles while addressing critical limitations of traditional multicast for AI workloads. These limitations include heavy management plane overhead and ecosystem compatibility issues. We implement MultiWrite on Ascend NPUs. Long-term stress tests demonstrate that our MultiWrite-based operators achieve up to 33% latency reduction on commercially deployed devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MultiWrite, a novel many-to-many transmission semantic that leverages multicast principles to eliminate redundant packet transmissions in collective operations such as AllGather and AlltoAll for AI model training. It claims to address the management overhead and compatibility issues of traditional multicast, with an implementation on Ascend NPUs demonstrating up to 33% latency reduction in long-term stress tests on deployed devices.

Significance. If the empirical claims hold after verification, this could meaningfully advance communication efficiency in large-scale distributed AI training by reducing redundant transmissions without major ecosystem overhauls. The emphasis on practical deployment on commercial hardware and long-term testing is a positive aspect that strengthens applicability.

major comments (2)
  1. [Abstract] Abstract: The reported up to 33% latency reduction from stress tests on deployed devices lacks any description of experimental setup, baselines, error bars, number of trials, or measurement methodology. This detail is load-bearing for the central empirical claim and prevents assessment of result robustness.
  2. [Implementation] Implementation section: The assertion that MultiWrite resolves traditional multicast's heavy management plane overhead and ecosystem compatibility issues for AI workloads is not supported by concrete mechanisms (e.g., lightweight group management for dynamic jobs or framework integration details) or quantified overhead measurements showing no new penalties are introduced.
minor comments (1)
  1. [Abstract] Clarify the exact duration and workload conditions for the 'long-term stress tests' to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the detailed and constructive review of our manuscript. Below we respond to each major comment and indicate the revisions made to address them.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported up to 33% latency reduction from stress tests on deployed devices lacks any description of experimental setup, baselines, error bars, number of trials, or measurement methodology. This detail is load-bearing for the central empirical claim and prevents assessment of result robustness.

    Authors: We concur that the abstract does not include sufficient information on the experimental setup, baselines, error bars, number of trials, or measurement methodology for the reported latency reductions. This is a valid observation. In the revised manuscript, we have updated the abstract to incorporate a high-level description of these elements and have significantly expanded the 'Evaluation' section to provide full details on the experimental methodology, including baselines (unicast-based collective implementations), error bars from multiple runs, number of trials, and the specifics of the long-term stress tests on deployed Ascend NPUs. These changes allow for a thorough assessment of the result robustness. revision: yes

  2. Referee: [Implementation] Implementation section: The assertion that MultiWrite resolves traditional multicast's heavy management plane overhead and ecosystem compatibility issues for AI workloads is not supported by concrete mechanisms (e.g., lightweight group management for dynamic jobs or framework integration details) or quantified overhead measurements showing no new penalties are introduced.

    Authors: We acknowledge the referee's point that the implementation section lacks concrete mechanisms and quantified measurements supporting the resolution of management plane overhead and ecosystem compatibility issues. To address this, we have revised the implementation section to detail the lightweight group management mechanisms designed for dynamic AI workloads, including framework integration specifics, and have added overhead quantification results demonstrating that MultiWrite introduces no additional penalties compared to traditional approaches. This provides the necessary support for our assertions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on implementation and empirical benchmarks

full rationale

The paper describes an engineering system (MultiWrite) for many-to-many collective communication that adopts multicast principles while addressing management overhead and compatibility. Its central claims are validated through implementation on Ascend NPUs and long-term stress tests reporting up to 33% latency reduction. No mathematical derivation chain, equations, fitted parameters, or predictions appear in the abstract or description. The contribution is self-contained via concrete mechanisms and external benchmark results on deployed hardware, with no reduction of outputs to inputs by construction or self-citation load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The work is primarily an implementation and systems paper; it introduces the MultiWrite semantic as a new entity but does not rely on fitted parameters or unstated mathematical axioms beyond standard networking assumptions.

invented entities (1)
  • MultiWrite no independent evidence
    purpose: A many-to-many transmission semantic that eliminates redundant packets using multicast principles adapted for AI workloads
    Presented as the core novel contribution that directly reduces operator latency

pith-pipeline@v0.9.0 · 5695 in / 1127 out tokens · 38348 ms · 2026-05-22T04:04:43.087732+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 6 internal anchors

  1. [1]

    Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. 2021. Synthesizing opti- mal collective algorithms. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 62–75. Exploiting Multicast for Accelerating Collective Communication

  2. [2]

    Hiascend CANN. 2025. HcclAllGather.https://www.hiascend.com/ document/detail/en/canncommercial/800/apiref/hcclapiref/hcclcpp_ 07_0023.html. Accessed: 2026-05-05

  3. [3]

    Hiascend CANN. 2025. HcclAlltoAllV.https://www.hiascend.com/ document/detail/en/canncommercial/800/apiref/hcclapiref/hcclcpp_ 07_0027.html. Accessed: 2026-05-05

  4. [4]

    Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, and Zhongxin Liu. 2026. Parallel scaling law for language models.Advances in Neural Information Processing Systems38 (2026), 118958–118998

  5. [5]

    DeepSeek-AI, Aixin Liu, et al. 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]https://arxiv.org/abs/2412.19437

  6. [6]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  7. [7]

    Yida Gu, Fakang Wang, Jianhao Fu, Zhenhang Sun, Qianyu Zhang, Hairui Zhao, Xingchen Liu, Yang Tian, Wenjing Huang, Zedong Liu, et al. 2026. CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training. InProceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 425–438

  8. [8]

    Guoliang He, Youhe Jiang, Wencong Xiao, Jiang Kaihua, Shuguang Wang, Jun Wang, Du Zixian, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, et al. 2026. Efficient pre-training of llms via topology-aware com- munication alignment on more than 9600 gpus.Advances in Neural Information Processing Systems38 (2026), 147100–147126

  9. [9]

    Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guid- ance.arXiv preprint arXiv:2207.12598(2022)

  10. [10]

    Torsten Hoefler, Christian Siebert, and Wolfgang Rehm. 2007. A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast. In2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 1–8

  11. [11]

    Chengyuan Huang, Yixiao Gao, Wei Chen, Duoxing Li, Yibo Xiao, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, et al

  12. [12]

    In2023 IEEE 31st International Conference on Network Protocols (ICNP)

    Mc-rdma: Improving replication performance of rdma-based distributed systems with reliable multicast support. In2023 IEEE 31st International Conference on Network Protocols (ICNP). IEEE, 1–11

  13. [13]

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems32 (2019)

  14. [14]

    Huawei Technologies Co., Ltd. 2026. Huawei Collective Communi- cation Library (HCCL).https://www.hiascend.com/software/cann. Online; accessed 14-May-2026

  15. [15]

    2024.InfiniBand Architecture Specifica- tion, Volume 1: General Specifications(release 1.8 ed.)

    InfiniBand Trade Association. 2024.InfiniBand Architecture Specifica- tion, Volume 1: General Specifications(release 1.8 ed.). Technical Report. InfiniBand Trade Association.https://www.infinibandta.org/ibta- specification/

  16. [16]

    Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, and Torsten Hoefler. 2024. Network-offloaded bandwidth-optimal broadcast and Allgather for distributed AI. InSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–17

  17. [17]

    Kyuho J Lee. 2021. Architecture of neural processing unit for deep neural networks. InAdvances in computers. Vol. 122. Elsevier, 217–245

  18. [18]

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668 (2020)

  19. [19]

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. Pytorch distributed: Experiences on accelerating data parallel training.arXiv preprint arXiv:2006.15704(2020)

  20. [20]

    Wenxue Li, Junyi Zhang, Yufei Liu, Gaoxiong Zeng, Zilong Wang, Chaoliang Zeng, Pengpeng Zhou, Qiaoling Wang, and Kai Chen. 2024. Cepheus: accelerating datacenter applications with high-performance roce-capable multicast. In2024 IEEE International Symposium on High- Performance Computer Architecture (HPCA). IEEE, 908–921

  21. [21]

    Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, et al. 2025. Understanding stragglers in large model training using what-if analysis. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 483–498

  22. [22]

    Jiuxing Liu, Amith R Mamidala, and Dhabaleswar K Panda. 2004. Fast and scalable MPI-level broadcast using InfiniBand’s hardware multi- cast support. In18th International Parallel and Distributed Processing Symposium, 2004. Proceedings.IEEE, 10

  23. [23]

    Qian Liu and Robert D Russell. 2014. IBRMP: A Reliable Multicast Protocol for InfiniBand. In2014 IEEE 22nd Annual Symposium on High- Performance Interconnects. IEEE, 79–86

  24. [24]

    Amith R Mamidala, Hyun-Wook Jin, and Dhabaleswar K Panda. 2005. Efficient hardware multicast group management for multiple mpi communicators over infiniband. InEuropean Parallel Virtual Ma- chine/Message Passing Interface Users’ Group Meeting. Springer, 388– 398

  25. [25]

    Microsoft Research. 2022. MSCCL: Microsoft Collective Communica- tion Library.https://github.com/microsoft/msccl. Online; accessed 14-May-2026

  26. [26]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron- lm. InProceedings of the international conference for high performance computing, netwo...

  27. [27]

    NVIDIA Corporation. 2026. NCCL User Guide: Collective Opera- tions.https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/ usage/collectives.html. Online; accessed 14-May-2026

  28. [28]

    NVIDIA Corporation. 2026. NVIDIA Collective Communication Li- brary (NCCL).https://developer.nvidia.com/nccl. Online; accessed 14-May-2026

  29. [29]

    NVIDIA Corporation. 2026. NVIDIA Collective Communication Li- brary (NCCL) User Guide.https://docs.nvidia.com/deeplearning/nccl/ user-guide/docs/. Online; accessed 14-May-2026

  30. [30]

    NVIDIA Corporation. 2026. NVIDIA NVLink and NVLink Switch. https://www.nvidia.com/en-us/data-center/nvlink/. Accessed: 2026- 05-12

  31. [31]

    Open Compute Project. 2025. Introducing ESUN: Ad- vancing Ethernet for Scale-Up AI Infrastructure at OCP. https://www.opencompute.org/blog/introducing-esun-advancing- ethernet-for-scale-up-ai-infrastructure-at-ocp. Accessed: 2026-05- 12

  32. [32]

    2025.OCP Scale-Up Ethernet (SUE) Speci- fication

    Open Compute Project. 2025.OCP Scale-Up Ethernet (SUE) Speci- fication. Technical Report. Open Compute Project.https://www. opencompute.org/documents/ocp-sue-spec-final-pdf-1Accessed: 2026-05-12

  33. [33]

    openEuler Community. 2026. UMDK Repository.https://gitcode.com/ openeuler/umdk. Accessed: 2026-05-11

  34. [34]

    Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow.arXiv preprint arXiv:1802.05799(2018)

  35. [35]

    Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2023. {TACCL}: Guiding collective algorithm synthe- sis using communication sketches. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 593–612

  36. [36]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. Chao Xu, Xu Zhang, Zihang Luo, Yuyan Wu, Guoxin Qian, Yufeng Yao, and CHIHYUNG WANG, Jingbin Zhou arXiv preprint arXiv:1909.08053(2019)

  37. [37]

    2025.UALink 1.0 White Paper

    UALink Consortium. 2025.UALink 1.0 White Paper. Technical Report. UALink Consortium.https://ualinkconsortium.org/wp- content/uploads/2025/04/UALink-1.0-White_Paper_FINAL.pdfAc- cessed: 2026-05-12

  38. [38]

    2025.Ultra Ethernet Specification

    Ultra Ethernet Consortium. 2025.Ultra Ethernet Specification. Techni- cal Report. Ultra Ethernet Consortium.https://ultraethernet.org/wp- content/uploads/sites/20/2025/06/UE-Specification-6.11.25.pdfAc- cessed: 2026-05-12

  39. [39]

    vLLM Team. 2024. vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction.https://vllm.ai/blog/perf-update. Online; accessed 14-May-2026

  40. [40]

    Rosen, Andrew Dolganow, Tony Przy- gienda, and Sam Aldrin

    IJsbrand Wijnands, Eric C. Rosen, Andrew Dolganow, Tony Przy- gienda, and Sam Aldrin. 2017. Multicast Using Bit Index Explicit Replication (BIER). RFC 8279. doi:10.17487/RFC8279

  41. [41]

    Rosen, Andrew Dolganow, Jeff Tantsura, Sam Aldrin, and Israel Meilik

    IJsbrand Wijnands, Eric C. Rosen, Andrew Dolganow, Jeff Tantsura, Sam Aldrin, and Israel Meilik. 2018. Encapsulation for Bit Index Explicit Replication (BIER) in MPLS and Non-MPLS Networks. RFC

  42. [42]

    doi:10.17487/RFC8296

  43. [43]

    Bin Xu, Ayan Banerjee, and Sandeep Gupta. 2025. Hardware Acceler- ation for Neural Networks: A Comprehensive Survey.arXiv preprint arXiv:2512.23914(2025)

  44. [44]

    Xiaohu Xu, Mach Chen, Keyur Patel, IJsbrand Wijnands, Tony Przy- gienda, and Zhaohui (Jeffrey) Zhang. 2025. BGP Extensions for Bit Index Explicit Replication (BIER). RFC 9793. doi:10.17487/RFC9793

  45. [45]

    Weihao Yang, Hao Huang, Donglei Wu, Ningke Li, Yanqi Pan, Qiyang Zheng, Wen Xia, Shiyi Li, and Qiang Wang. 2025. HybridEP: Scaling Ex- pert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission.arXiv preprint arXiv:2510.19470(2025)

  46. [46]

    Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578. CHIHYUNG WANG, Jingbin Zhou„