arxiv: 2605.05888 · v1 · submitted 2026-05-07 · 💻 cs.AR · cs.DC

Recognition: unknown

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems

Zhuoshan Zhou , Chen Zhang , Shuyi Zhang , Qijun Zhang , Haibo Wang , Zhe Zhou , Zhipeng Tu , Guangyu Sun

show 5 more authors

Yijia Diao Zhigang Ji Jingwen Leng Guanghui He Minyi Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:27 UTC · model grok-4.3

classification 💻 cs.AR cs.DC

keywords Mixture-of-Expertsmulti-GPU communicationhardware-software co-designcommunication overlapGPU hub architectureinter-GPU interconnectlarge language model scalingaddress management

0 comments

The pith

MoE-Hub decouples data transmission from address management to enable seamless communication overlap in multi-GPU MoE systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts models face scaling limits on multi-GPU systems because their dynamic token-to-expert routing clashes with the static, address-centric communication model built into GPUs. This clash requires a heavy software layer to resolve addresses before any transfer can begin, which prevents clean overlap between communication and computation. MoE-Hub introduces a destination-agnostic paradigm in which producers transmit data immediately after routing decisions, using only a logical destination identifier. Lightweight hardware placed in the GPU hub then takes over address allocation and flow orchestration transparently. A sympathetic reader would care because this change promises both higher performance and simpler software for the large language models that rely on MoE scaling.

Core claim

MoE-Hub resolves the root abstraction mismatch by hardware-accelerating the entire communication control plane, so that producers send data right after routing with only a logical destination while the GPU hub transparently performs address allocation and data-flow orchestration, producing seamless overlap and measured per-layer speedups of 1.40x-3.08x together with end-to-end speedups of 1.21x-1.98x.

What carries the argument

The destination-agnostic communication paradigm in which data transmission is separated from address management and offloaded to lightweight hardware inside the GPU hub.

If this is right

Per-layer execution times improve by 1.40x to 3.08x over current state-of-the-art systems.
Full end-to-end training or inference runs improve by 1.21x to 1.98x.
Communication and computation overlap becomes transparent without extra programmer effort.
Software for dynamic token routing becomes simpler because address resolution is removed from the critical path.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling pattern could support other irregular, data-dependent communication patterns that appear in graph neural networks or sparse transformers.
GPU vendors might consider embedding similar lightweight hubs in future interconnect designs to handle dynamic workloads more efficiently.
MoE training frameworks could drop custom address-resolution passes and rely on the hardware abstraction instead.
Scaling experiments on clusters larger than those tested would reveal whether the hub remains lightweight when the number of experts and GPUs grows.

Load-bearing premise

Lightweight hardware added to the GPU hub can manage address allocation and data-flow orchestration at scale without adding performance overhead or requiring incompatible changes to existing GPU architectures.

What would settle it

A prototype implementation of the GPU hub hardware that shows either increased end-to-end latency for MoE layers or no improvement in overlap relative to current software baselines would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.05888 by Chen Zhang, Guanghui He, Guangyu Sun, Haibo Wang, Jingwen Leng, Minyi Guo, Qijun Zhang, Shuyi Zhang, Yijia Diao, Zhe Zhou, Zhigang Ji, Zhipeng Tu, Zhuoshan Zhou.

**Figure 1.** Figure 1: Comparison of computation-communication overlap view at source ↗

**Figure 2.** Figure 2: Exemplary Execution of a 4-expert MoE layer dis view at source ↗

**Figure 3.** Figure 3: MoE performance gap to ideal (Mixtral-8×7B on view at source ↗

**Figure 4.** Figure 4: From software-mediated address resolution (a, b) to hardware method (c), MoE-Hub eliminates heavy software overheads. view at source ↗

**Figure 6.** Figure 6: Proportion of communication latency on both producer view at source ↗

**Figure 7.** Figure 7: MoE-Hub overview: hub-side extensions and datapath view at source ↗

**Figure 8.** Figure 8: ISA and runtime API extensions. focus is on communication-model optimization, it employs a static memory management model, pre-allocating conservative expert-input regions to prevent token overflow. Behind the rowspMalloc API, a more advanced memory management scheme (e.g., paging-style KV-cache management or emerging GPU-side proposals [8], [15], [30], [33]) can be integrated to improve memory utilization… view at source ↗

**Figure 9.** Figure 9: Illustration of MoE layer dataflow under the destination-agnostic communication paradigm. view at source ↗

**Figure 10.** Figure 10: Runtime Packet Manager microarchitecture. view at source ↗

**Figure 11.** Figure 11: Data Availability Manager microarchitecture and its view at source ↗

**Figure 12.** Figure 12: End-to-end latency evaluation. Total token count is M = SeqLength × NGPU, where NGPU is the number of devices. view at source ↗

**Figure 13.** Figure 13: Layer duration and speedup with varying total token count on 8 GPUs. Other parameters follow Mixtral 8×7B. view at source ↗

**Figure 14.** Figure 14: Layer duration and speedup with a total token count view at source ↗

**Figure 15.** Figure 15: Comparison of four MoE-Hub variants with Comet from routing to expert GEMM1, and speedups achieved by Runtime view at source ↗

**Figure 16.** Figure 16: End-to-end latency evaluation on Qwen-2 and Phi-3.5 view at source ↗

**Figure 17.** Figure 17: Normalized FLOPS for 1×, 2×, 4×, and 8× GPU view at source ↗

read the original abstract

The Mixture-of-Experts (MoE) architecture is crucial for scaling large language models, but its scalability is severely limited by inter-GPU communication bottlenecks in multi-GPU systems. Although overlapping communication with computation is a widely recognized optimization, its effective deployment still remains challenging, both in terms of performance and programmability. In this work, we identify the root cause as a fundamental abstraction mismatch between MoE's dynamic, irregular token-to-expert mapping and the static, address-centric communication model of modern GPUs, which necessitates a complex software mediation phase to resolve addresses before data transfers, limiting performance and software flexibility. To resolve this, we propose MoE-Hub, a hardware-software co-design that introduces a destination-agnostic communication paradigm. MoE-Hub decouples data transmission from address management, allowing producers to send data immediately after routing using only a logical destination, while address allocation and data-flow orchestration are handled transparently by lightweight hardware in the GPU hub. By hardware-accelerating the entire communication control plane, MoE-Hub enables seamless and transparent overlap. Our evaluation shows that MoE-Hub achieves 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedup over state-of-the-art systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoE-Hub's destination-agnostic comm plus GPU-hub hardware aims to cut software address work in MoE, but the hardware feasibility stays unproven in the writeup.

read the letter

The main thing to know is that MoE-Hub adds lightweight hardware in the GPU hub to let MoE producers send tokens using only a logical destination right after routing, while the hardware handles address allocation and data-flow orchestration. This is meant to remove the software mediation layer that currently blocks clean overlap between compute and communication on multi-GPU systems. The paper frames the root cause as an abstraction mismatch between MoE's irregular token-to-expert mapping and the static, address-centric model of current GPUs, which forces extra steps before any transfer can start. That diagnosis lines up with how these workloads behave in practice. The destination-agnostic paradigm itself is the clearest new element; it decouples the send from physical address resolution and moves the control plane into hardware. If the hardware can stay truly lightweight, the approach would simplify software and improve overlap without major code changes. The reported numbers are 1.40x-3.08x per layer and 1.21x-1.98x end-to-end versus prior systems, which would be useful gains for scaling MoE models if they hold. The soft spot is the hardware proposal. The design assumes the added logic can resolve destinations, manage flow, and fit inside existing memory models and interconnects like NVLink without new overheads or CUDA/driver changes. No microarchitectural specs, area or latency estimates, or compatibility analysis appear, so the transparent and lightweight claims remain assumptions rather than demonstrated facts. The abstract also gives no experimental methodology, baseline details, or hardware configuration, which makes the speedups difficult to verify. This paper is aimed at systems researchers who work on hardware-software co-design for large language model training and inference on clusters. Someone already thinking about communication primitives or GPU extensions for AI would get value from the problem statement and the high-level solution. It is worth sending to peer review because the bottleneck it targets is real and the proposed co-design is concrete enough for referees to evaluate and push on the missing details.

Referee Report

2 major / 0 minor

Summary. The paper identifies the root cause of limited scalability in Mixture-of-Experts (MoE) architectures on multi-GPU systems as an abstraction mismatch between dynamic token-to-expert mappings and the static address-centric communication model of GPUs, which requires complex software mediation. It proposes MoE-Hub, a hardware-software co-design that introduces a destination-agnostic communication paradigm. This allows producers to send data using only logical destinations immediately after routing, with address allocation and data-flow orchestration handled transparently by lightweight hardware in the GPU hub. Hardware acceleration of the communication control plane enables seamless overlap, with reported speedups of 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end over state-of-the-art systems.

Significance. If the central claims hold, MoE-Hub could significantly advance the field by simplifying software for MoE overlap and providing hardware support for efficient communication in multi-GPU setups. This co-design approach addresses a key bottleneck in scaling large language models, potentially leading to improved performance and flexibility in distributed training and inference without major changes to existing GPU architectures.

major comments (2)

The abstract presents concrete performance improvements (1.40x-3.08x per-layer and 1.21x-1.98x end-to-end) but omits any description of the experimental methodology, chosen baselines, hardware platform details, or error bars, which is critical for assessing the validity of the speedups claimed for the proposed design.
The assertion that 'lightweight hardware in the GPU hub' can transparently resolve logical destinations to physical addresses and orchestrate data flows without software mediation or performance overheads lacks supporting details such as microarchitectural specifications, estimated area and latency costs, or compatibility analysis with current GPU memory models and interconnects like NVLink.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and positive assessment of the significance of MoE-Hub. We address each major comment below, providing clarifications from the manuscript and indicating planned revisions.

read point-by-point responses

Referee: The abstract presents concrete performance improvements (1.40x-3.08x per-layer and 1.21x-1.98x end-to-end) but omits any description of the experimental methodology, chosen baselines, hardware platform details, or error bars, which is critical for assessing the validity of the speedups claimed for the proposed design.

Authors: We agree that space constraints in the abstract limit inclusion of full methodology details. The Evaluation section of the manuscript fully specifies the experimental setup, including the multi-GPU hardware platform with NVLink interconnects, the chosen state-of-the-art baselines for MoE systems, and performance results reported with variability across repeated runs. To address the concern, we will revise the abstract to include a concise reference to the evaluation methodology and hardware platform. revision: yes
Referee: The assertion that 'lightweight hardware in the GPU hub' can transparently resolve logical destinations to physical addresses and orchestrate data flows without software mediation or performance overheads lacks supporting details such as microarchitectural specifications, estimated area and latency costs, or compatibility analysis with current GPU memory models and interconnects like NVLink.

Authors: The manuscript describes the MoE-Hub hardware-software co-design at the architectural level, emphasizing how the GPU hub transparently handles address resolution and data-flow orchestration to enable destination-agnostic transmission. We acknowledge that the current version does not include detailed microarchitectural specifications, quantitative area/latency estimates, or exhaustive compatibility analysis. We will expand the design section with additional qualitative discussion of compatibility with existing GPU memory models and NVLink, while noting that full hardware implementation metrics are left for future work. revision: partial

standing simulated objections not resolved

Quantitative microarchitectural specifications, area, and latency cost estimates for the proposed lightweight hardware, as these require detailed hardware modeling and synthesis beyond the scope of the current manuscript.

Circularity Check

0 steps flagged

No circularity in MoE-Hub hardware-software co-design proposal

full rationale

The paper presents a systems design for decoupling data transmission from address management in MoE communication via a new GPU hub hardware component. No mathematical derivations, equations, fitted parameters, or predictions are present in the abstract or described claims. The central argument rests on identifying an abstraction mismatch and proposing a transparent hardware solution, without any self-referential definitions, self-citation load-bearing steps, or renaming of known results. The evaluation speedups are reported outcomes rather than constructed predictions. This is a standard non-circular design proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that the identified abstraction mismatch is the dominant limiter and that the proposed lightweight GPU hub hardware is feasible to implement.

axioms (1)

domain assumption Modern GPUs rely on a static, address-centric communication model that conflicts with MoE's dynamic token-to-expert mapping.
Explicitly identified in the abstract as the root cause of limited scalability.

invented entities (2)

Destination-agnostic communication paradigm no independent evidence
purpose: Decouples data transmission from address management so producers can send immediately using only a logical destination
New paradigm introduced by the paper
Lightweight hardware in the GPU hub no independent evidence
purpose: Transparently handles address allocation and data-flow orchestration
Proposed hardware addition without external validation of feasibility or overhead

pith-pipeline@v0.9.0 · 5586 in / 1299 out tokens · 51448 ms · 2026-05-08T04:27:19.757152+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 23 canonical work pages · 9 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

M. Abdin, J. Aneja, H. Awadalla, A. Awadallah, A. A. Awan, N. Bach, A. Bahree, A. Bakhtiari, J. Bao, H. Behl, A. Benhaim, M. Bilenko, J. Bjorck, S. Bubeck, M. Cai, Q. Cai, V . Chaudhary, D. Chen, D. Chen, W. Chen, Y .-C. Chen, Y .-L. Chen, H. Cheng, P. Chopra, X. Dai, M. Dixon, R. Eldan, V . Fragoso, J. Gao, M. Gao, M. Gao, A. Garg, A. D. Giorno, A. Goswa...

work page internal anchor Pith review arXiv 2024
[2]

Flashdmoe: Fast distributed moe in a single kernel,

O. J. Aimuyo, B. Oh, and R. Singh, “Flashdmoe: Fast distributed moe in a single kernel,”arXiv preprint arXiv:2506.04667, 2025

work page arXiv 2025
[3]

Moe training best practices on amd gpus

AMD, “Moe training best practices on amd gpus.” https://rocm.blogs.amd.com/software-tools-optimization/primus-moe- package/README.html, 2025

2025
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, vol. 1, no. 2, p. 3, 2023

work page internal anchor Pith review arXiv 2023
[5]

Verifying semantic equivalence of large models with equality saturation

O. Balmau, A.-M. Kermarrec, R. Pires, A. L. E. Santo, M. de V os, and M. Vujasinovic, “Accelerating moe model inference with expert sharding,” inProceedings of the 5th Workshop on Machine Learning and Systems, ser. EuroMLSys ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 192–199. [Online]. Available: https://doi.org/10.1145/3721146.3721940

work page doi:10.1145/3721146.3721940 2025
[6]

Cen- tauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning,

C. Chen, X. Li, Q. Zhu, J. Duan, P. Sun, X. Zhang, and C. Yang, “Cen- tauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 178–191

2024
[7]

Scalable irregular parallelism with gpus: getting cpus out of the way,

Y . Chen, B. Brock, S. Porumbescu, A. Buluc ¸, K. Yelick, and J. D. Owens, “Scalable irregular parallelism with gpus: getting cpus out of the way,” inProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’22. IEEE Press, 2022

2022
[8]

Mirage persistent kernel: A compiler and runtime for mega-kernelizing tensor programs,

X. Cheng, Z. Zhang, Y . Zhou, J. Ji, J. Jiang, Z. Zhao, Z. Xiao, Z. Ye, Y . Huang, R. Lai, H. Jin, B. Hou, M. Wu, Y . Dong, A. Yip, Z. Ye, S. Wang, W. Yang, X. Miao, T. Chen, and Z. Jia, “Mirage persistent kernel: A compiler and runtime for mega-kernelizing tensor programs,”
[9]

Cheng, Z

[Online]. Available: https://arxiv.org/abs/2512.22219

work page arXiv
[10]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang, “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,” 2024. [Online]. Available: https://arxiv.org/abs/2401.06066

work page internal anchor Pith review arXiv 2024
[11]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Yang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Chen, J. Yuan, J. Qiu, J. Song, K. Dong, K. Gao, K. Guan, L. Wan...

work page internal anchor Pith review arXiv 2024
[12]

Deepseek-v3 technical report,

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J...
[13]

DeepSeek-V3 Technical Report

[Online]. Available: https://arxiv.org/abs/2412.19437

work page internal anchor Pith review arXiv
[14]

GLaM: Efficient scaling of language models with mixture-of-experts,

N. Du, Y . Huang, A. M. Dai, S. Tong, D. Lepikhin, Y . Xu, M. Krikun, Y . Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. Le, Y . Wu, Z. Chen, and C. Cui, “GLaM: Efficient scaling of language models with mixture-of-experts,” inProce...

2022
[15]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

2022
[16]

Megablocks: Effi- cient sparse training with mixture-of-experts,

T. Gale, D. Narayanan, C. Young, and M. Zaharia, “Megablocks: Effi- cient sparse training with mixture-of-experts,”Proceedings of Machine Learning and Systems, vol. 5, pp. 288–304, 2023

2023
[17]

Gmlake: Efficient and transparent gpu memory defragmentation for large-scale dnn training with virtual memory stitching,

C. Guo, R. Zhang, J. Xu, J. Leng, Z. Liu, Z. Huang, M. Guo, H. Wu, S. Zhao, J. Zhao, and K. Zhang, “Gmlake: Efficient and transparent gpu memory defragmentation for large-scale dnn training with virtual memory stitching,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume...

work page doi:10.1145/3620665.3640423 2024
[18]

Faster- moe: modeling and optimizing training of large-scale dynamic pre- trained models,

J. He, J. Zhai, T. Antunes, H. Wang, F. Luo, S. Shi, and Q. Li, “Faster- moe: modeling and optimizing training of large-scale dynamic pre- trained models,” inProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022, pp. 120–134

2022
[19]

arXiv:2504.19519 [cs]

K. Hong, X. Li, M. Liu, Q. Mao, T. Wu, Z. Huang, L. Chen, Z. Wang, Y . Zhang, Z. Zhu, G. Dai, and Y . Wang, “Efficient and adaptable overlapping for computation and communication via signaling and reordering,” 2025. [Online]. Available: https://arxiv.org/abs/2504.19519

work page arXiv 2025
[20]

Toward efficient inference for mixture of experts,

H. Huang, N. Ardalani, A. Sun, L. Ke, H.-H. S. Lee, S. Bhosale, C.-J. Wu, and B. Lee, “Toward efficient inference for mixture of experts,”Advances in Neural Information Processing Systems, vol. 37, pp. 84 033–84 059, 2024

2024
[21]

Tutel: Adaptive mixture-of-experts at scale,

C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, 14 R. Salas, J. Jose, P. Ram, H. Chau, P. Cheng, F. Yang, M. Yang, and Y . Xiong, “Tutel: Adaptive mixture-of-experts at scale,” inProceedings of Machine Learning and Systems, D. Song, M. Carbin, and T. Chen, Eds., vol. 5. Curan, 2023, pp. 269–287. [Online]. Available: https://proceedings.mlsys...

2023
[22]

Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference,

R. Hwang, J. Wei, S. Cao, C. Hwang, X. Tang, T. Cao, and M. Yang, “Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference,” in2024 ACM/IEEE 51st Annual Interna- tional Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 1018–1031

2024
[23]

Breaking the com- putation and communication abstraction barrier in distributed machine learning workloads,

A. Jangda, J. Huang, G. Liu, A. H. N. Sabet, S. Maleki, Y . Miao, M. Musuvathi, T. Mytkowicz, and O. Saarikivi, “Breaking the com- putation and communication abstraction barrier in distributed machine learning workloads,” inProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022, ...

2022
[24]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” 2024. [Onl...

work page internal anchor Pith review arXiv 2024
[25]

Lancet: Accelerating mixture-of-experts training via whole graph computation- communication overlapping,

C. Jiang, Y . Tian, Z. Jia, S. Zheng, C. Wu, and Y . Wang, “Lancet: Accelerating mixture-of-experts training via whole graph computation- communication overlapping,”Proceedings of Machine Learning and Systems, vol. 6, pp. 74–86, 2024

2024
[26]

Minference 1.0: accelerating pre-filling for long-context llms via dynamic sparse attention,

H. Jiang, Y . Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. H. Abdi, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu, “Minference 1.0: accelerating pre-filling for long-context llms via dynamic sparse attention,” inPro- ceedings of the 38th International Conference on Neural Information Processing Systems, ser. NIPS ’24. Red Hook, NY , USA: Curran Associates Inc., 2024

2024
[27]

A detailed and flexible cycle-accurate network-on-chip simulator,

N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. E. Shaw, J. Kim, and W. J. Dally, “A detailed and flexible cycle-accurate network-on-chip simulator,” in2013 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 2013, pp. 86–96

2013
[28]

MegaScale-MoE:Large-ScaleCommunication- Efficient Training of Mixture-of-Experts Models in Production

C. Jin, Z. Jiang, Z. Bai, Z. Zhong, J. Liu, X. Li, N. Zheng, X. Wang, C. Xie, Q. Huang, W. Heng, Y . Ma, W. Bao, S. Zheng, Y . Peng, H. Lin, X. Liu, X. Jin, and X. Liu, “Megascale-moe: Large- scale communication-efficient training of mixture-of-experts models in production,” 2025. [Online]. Available: https://arxiv.org/abs/2505.11432

work page arXiv 2025
[29]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review arXiv 2001
[30]

Locality- centric data and threadblock management for massive gpus,

M. Khairy, V . Nikiforov, D. Nellans, and T. G. Rogers, “Locality- centric data and threadblock management for massive gpus,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 1022–1036

2020
[31]

Accel-sim: An extensible simulation framework for validated gpu modeling,

M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 473–486

2020
[32]

InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...

work page doi:10.1145/3600006.3613165 2023
[33]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,”arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review arXiv 2006
[34]

Accelerating distributed MoE training and inference with lina,

J. Li, Y . Jiang, Y . Zhu, C. Wang, and H. Xu, “Accelerating distributed MoE training and inference with lina,” in2023 USENIX Annual Technical Conference (USENIX ATC 23). Boston, MA: USENIX Association, Jul. 2023, pp. 945–959. [Online]. Available: https://www.usenix.org/conference/atc23/presentation/li-jiamin

2023
[35]

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache,

B. Lin, C. Zhang, T. Peng, H. Zhao, W. Xiao, M. Sun, A. Liu, Z. Zhang, L. Li, X. Qiu, S. Li, Z. Ji, T. Xie, Y . Li, and W. Lin, “Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache,” 2024. [Online]. Available: https://arxiv.org/abs/2401.02669

work page arXiv 2024
[36]

Janus: A unified distributed training framework for sparse mixture-of-experts models,

J. Liu, J. H. Wang, and Y . Jiang, “Janus: A unified distributed training framework for sparse mixture-of-experts models,” inProceedings of the ACM SIGCOMM 2023 Conference, ser. ACM SIGCOMM ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 486–498. [Online]. Available: https://doi.org/10.1145/3603269.3604869

work page doi:10.1145/3603269.3604869 2023
[37]

Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

E. Lu, Z. Jiang, J. Liu, Y . Du, T. Jiang, C. Hong, S. Liu, W. He, E. Yuan, Y . Wang, Z. Huang, H. Yuan, S. Xu, X. Xu, G. Lai, Y . Chen, H. Zheng, J. Yan, J. Su, Y . Wu, N. Y . Zhang, Z. Yang, X. Zhou, M. Zhang, and J. Qiu, “Moba: Mixture of block attention for long-context llms,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13189

work page arXiv 2025
[38]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation,

Meta AI, “The llama 4 herd: The beginning of a new era of natively multimodal ai innovation,”https://ai.meta.com/blog/llama-4- multimodal-intelligence/, checked on, vol. 4, no. 7, p. 2025, 2025

2025
[39]

Finepack: Transparently improving the efficiency of fine-grained trans- fers in multi-gpu systems,

H. Muthukrishnan, D. Lustig, O. Villa, T. Wenisch, and D. Nellans, “Finepack: Transparently improving the efficiency of fine-grained trans- fers in multi-gpu systems,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 516–529

2023
[40]

Efficient multi-gpu shared memory via automatic optimiza- tion of fine-grained transfers,

H. Muthukrishnan, D. Nellans, D. Lustig, J. A. Fessler, and T. F. Wenisch, “Efficient multi-gpu shared memory via automatic optimiza- tion of fine-grained transfers,” in2021 ACM/IEEE 48th Annual Interna- tional Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 139–152

2021
[41]

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J

X. Nie, X. Miao, Z. Wang, Z. Yang, J. Xue, L. Ma, G. Cao, and B. Cui, “Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement,”Proc. ACM Manag. Data, vol. 1, no. 1, May 2023. [Online]. Available: https://doi.org/10.1145/3588964

work page doi:10.1145/3588964 2023
[42]

Peer-to-peer & unified virtual addressing

NVIDIA, “Peer-to-peer & unified virtual addressing.” https://developer.download.nvidia.com/CUDA/training/cuda webinars GPUDirect uva.pdf, 2011

2011
[43]

Doubling all2all performance with nvidia collective communi- cation library 2.12,

——, “Doubling all2all performance with nvidia collective communi- cation library 2.12,”https://developer.nvidia.com/blog/doubling-all2all- performance-with/nvidia-collective-communication/library-2-12/, 2022

2022
[44]

Applying mixture of experts in llm architectures

——, “Applying mixture of experts in llm architectures.” https://developer.nvidia.com/blog/applying-mixture-of-experts-in-llm- architectures/, 2024

2024
[45]

Cuda templates for linear algebra subroutines

——, “Cuda templates for linear algebra subroutines.” https://github.com/NVIDIA/cutlass, 2024

2024
[46]

Introduction to nvidia dgx h100/h200 systems

——, “Introduction to nvidia dgx h100/h200 systems.” https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to- dgxh100.html, 2024

2024
[47]

New llm: Snowflake arctic model for sql and code generation

——, “New llm: Snowflake arctic model for sql and code generation.”https://developer.nvidia.com/blog/new-llm-snowflake-arctic- model-for-sql-and-code-generation/, 2024

2024
[48]

Mixture of experts powers the most intelligent frontier ai models, runs 10x faster to deliver 1/10 the token cost on nvidia blackwell nvl72

——, “Mixture of experts powers the most intelligent frontier ai models, runs 10x faster to deliver 1/10 the token cost on nvidia blackwell nvl72.” https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/, 2025

2025
[49]

Nvidia openshmem library (nvshmem) documentation

——, “Nvidia openshmem library (nvshmem) documentation.” https://docs.nvidia.com/nvshmem/api/index.html, 2025

2025
[50]

Transformer engine

——, “Transformer engine.”https://github.com/NVIDIA/TransformerEngine, 2025

2025
[51]

Gpt-oss

OpenAI, “Gpt-oss.”https://github.com/openai/gpt-oss, 2025

2025
[52]

Introducing gpt-5

——, “Introducing gpt-5.”https://openai.com/index/introducing-gpt-5, 2025

2025
[53]

T3: Transparent tracking & triggering for fine-grained overlap of compute & collectives,

S. Pati, S. Aga, M. Islam, N. Jayasena, and M. D. Sinclair, “T3: Transparent tracking & triggering for fine-grained overlap of compute & collectives,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1146–1164

2024
[54]

Deepspeed-moe: Advancing mixture- of-experts inference and training to power next-generation ai scale,

S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y . Aminabadi, A. A. Awan, J. Rasley, and Y . He, “Deepspeed-moe: Advancing mixture- of-experts inference and training to power next-generation ai scale,” inInternational conference on machine learning. PMLR, 2022, pp. 18 332–18 346

2022
[55]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,

J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” inProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 3505– 3506

2020
[56]

Introducing dbrx: A new state-of-the-art open llm,

M. Research, “Introducing dbrx: A new state-of-the-art open llm,” https://www.databricks.com/blog/introducing-dbrx-new-state-art-open- llm, 2024

2024
[57]

Scaling vision with 15 sparse mixture of experts,

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby, “Scaling vision with 15 sparse mixture of experts,”Advances in Neural Information Processing Systems, vol. 34, pp. 8583–8595, 2021

2021
[58]

The sparsely-gated mixture-of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “The sparsely-gated mixture-of-experts layer,”Outrageously large neural networks, vol. 2, 2017

2017
[59]

Pipemoe: Accelerating mixture- of-experts through adaptive pipelining,

S. Shi, X. Pan, X. Chu, and B. Li, “Pipemoe: Accelerating mixture- of-experts through adaptive pipelining,” inIEEE INFOCOM 2023-IEEE Conference on Computer Communications. IEEE, 2023, pp. 1–10

2023
[60]

Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling,

S. Shi, X. Pan, Q. Wang, C. Liu, X. Ren, Z. Hu, Y . Yang, B. Li, and X. Chu, “Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling,” inProceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 236–249

2024
[61]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- zaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review arXiv 1909
[62]

Pangu ultra moe: How to train your big moe on ascend npus,

Y . Tang, Y . Yin, Y . Wang, H. Zhou, Y . Pan, W. Guo, Z. Zhang, M. Rang, F. Liu, N. Zhang, B. Li, Y . Dong, X. Meng, Y . Wang, D. Li, Y . Li, D. Tu, C. Chen, Y . Yan, F. Yu, R. Tang, Y . Wang, B. Huang, B. Wang, B. Liu, C. Zhang, D. Kuang, F. Liu, G. Huang, J. Wei, J. Qin, J. Ran, J. Li, J. Zhao, L. Dai, L. Li, L. Deng, P. Qin, P. Zeng, Q. Gu, S. Tang, S...

work page arXiv 2025
[63]

Harnessing inter-gpu shared memory for seamless moe communication-computation fusion,

H. Wang, Y . Xia, D. Yang, X. Zhou, and D. Cheng, “Harnessing inter-gpu shared memory for seamless moe communication-computation fusion,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2025, pp. 170–182

2025
[64]

Overlap communication with dependent computation via decomposition in large deep learning models,

S. Wang, J. Wei, A. Sabne, A. Davis, B. Ilbeyi, B. Hechtman, D. Chen, K. S. Murthy, M. Maggioni, Q. Zhang, S. Kumar, T. Guo, Y . Xu, and Z. Zhou, “Overlap communication with dependent computation via decomposition in large deep learning models,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and O...

work page doi:10.1145/3567955.3567959 2023
[65]

Prophet: Fine-grained load balancing for parallel training of large- scale moe models,

W. Wang, Z. Lai, S. Li, W. Liu, K. Ge, Y . Liu, A. Shen, and D. Li, “Prophet: Fine-grained load balancing for parallel training of large- scale moe models,” in2023 IEEE International Conference on Cluster Computing (CLUSTER), 2023, pp. 82–94

2023
[66]

Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

R. Xu, G. Xiao, H. Huang, J. Guo, and S. Han, “Xattention: Block sparse attention with antidiagonal scoring,” 2025. [Online]. Available: https://arxiv.org/abs/2503.16428

work page arXiv 2025
[67]

Moesys: A distributed and efficient mixture-of-experts training and inference system for internet services,

D. Yu, L. Shen, H. Hao, W. Gong, H. Wu, J. Bian, L. Dai, and H. Xiong, “Moesys: A distributed and efficient mixture-of-experts training and inference system for internet services,”IEEE Transactions on Services Computing, vol. 17, no. 5, pp. 2626–2639, 2024

2024
[68]

SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization,

M. Zhai, J. He, Z. Ma, Z. Zong, R. Zhang, and J. Zhai, “SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization,” in2023 USENIX Annual Technical Conference (USENIX ATC 23). Boston, MA: USENIX Association, Jul. 2023, pp. 961–975. [Online]. Available: https://www.usenix.org/conference/atc23/presentation/zhai

2023
[69]

Comet: Fine-grained computation-communication overlapping for mixture-of-experts,

S. Zhang, N. Zheng, H. Lin, Z. Jiang, W. Bao, C. Jiang, Q. Hou, W. Cui, S. Zheng, L.-W. Chang, Q. Chen, and X. Liu, “Comet: Fine-grained computation-communication overlapping for mixture-of-experts,” inProceedings of Machine Learning and Systems, M. Zaharia, G. Joshi, and Y . Lin, Eds., vol. 7. MLSys, 2025. [Online]. Available: https://proceedings.mlsys.o...

2025
[70]

Mpmoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism,

Z. Zhang, Y . Xia, H. Wang, D. Yang, C. Hu, X. Zhou, and D. Cheng, “Mpmoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism,”IEEE Transactions on Parallel and Distributed Systems, vol. 35, no. 6, pp. 998–1011, 2024

2024
[71]

Deepep: an efficient expert-parallel communication library,

C. Zhao, S. Zhou, L. Zhang, C. Deng, Z. Xu, Y . Liu, K. Yu, J. Li, and L. Zhao, “Deepep: an efficient expert-parallel communication library,” https://github.com/deepseek-ai/DeepEP, 2025. 16

2025