arxiv: 2604.19241 · v1 · submitted 2026-04-21 · 💻 cs.DC

Recognition: unknown

UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

Size Zheng , Xuegui Zheng , Li-wen Chang , Jidong Zhai

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:15 UTC · model grok-4.3

classification 💻 cs.DC

keywords mixture-of-expertsexpert parallelismmega kernelsLLM trainingcommunication overlapnumerical consistencymodel scalingparallel computing

0 comments

The pith

Fusing communication and computation into MegaKernels for expert-parallel MoE models delivers speedups while enforcing exact numerical consistency with sequential runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniEP to handle the growing demands of training large mixture-of-experts language models where communication increasingly limits overall progress. It unifies multiple expert-parallelism strategies by merging communication steps and computation steps into single large MegaKernels. This change converts scattered tuning choices into one searchable parameter space that supports automated adjustments. The system adds a deterministic token ordering rule so that aggressive overlapping of operations still produces identical numerical results to running everything in sequence. The result is faster training on GPU clusters that still satisfies the strict accuracy needs of production-scale work.

Core claim

UniEP fuses the MoE communication and computation into MegaKernels, effectively transforming complex architectural tuning into a unified parameter search space for automated adaptability. It incorporates a deterministic token ordering mechanism that guarantees numerical consistency with sequential execution even under aggressive overlap schedules. Evaluations show that this approach achieves 1.03×-1.38× speedups over state-of-the-art methods while mitigating communication bottlenecks and meeting rigorous accuracy standards.

What carries the argument

MegaKernels that fuse MoE communication and computation, paired with a deterministic token ordering mechanism that preserves numerical identity under overlap.

If this is right

Communication bottlenecks in expert-parallel MoE training are reduced through fused kernels.
Architectural tuning becomes a single searchable space instead of separate ad-hoc choices.
Numerical accuracy remains equivalent to non-overlapped sequential execution.
Multiple expert-parallelism strategies can be applied uniformly without custom kernels for each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unified parameter space could support automatic retuning when hardware or model sizes change without rewriting kernels.
Exact numerical matching enables reliable comparison of training runs that use different overlap levels.
Similar fusion of communication and computation might be applied to other parallelization methods beyond expert parallelism.
Reduced reliance on manual kernel design could shorten the time needed to scale models to new cluster sizes.

Load-bearing premise

That a deterministic token ordering mechanism can be realized to keep results identical to sequential execution during aggressive overlaps without adding hidden performance or stability costs across different configurations.

What would settle it

Running identical training inputs and random seeds once with the overlapped MegaKernel schedule and once with strict sequential execution, then checking whether loss curves, output values, or final weights match exactly.

read the original abstract

The exponential growth in Large Language Model (LLM) parameters has transformed model training into an increasingly resource-intensive endeavor. With the stagnation of Moore's Law and the widening disparity between computation throughput and communication bandwidth, expert parallelism (EP) has emerged as a critical strategy for scaling mixture-of-experts (MoE) models. However, despite numerous proposals for optimizing EP, ranging from communication compression to computation-communication overlap, adoption within production-grade frameworks like Megatron-LM remains conservative. Existing solutions often rely on ad-hoc, complex kernels that lack adaptability across diverse optimization configurations and frequently neglect numerical stability, failing to meet the strict precision requirements of large-scale training. In this paper, we introduce UniEP, a novel system that unifies diverse EP optimization strategies into a cohesive abstraction. UniEP fuses the MoE communication and computation into MegaKernels, effectively transforming complex architectural tuning into a unified parameter search space for automated adaptability. Crucially, UniEP incorporates a deterministic token ordering mechanism that guarantees numerical consistency with sequential execution, even under aggressive overlap schedules. We evaluate UniEP on GPU clusters equipped with NVIDIA Hopper GPUs. Our results demonstrate that UniEP achieves 1.03$\times$-1.38$\times$ speedups over state-of-the-art work, effectively mitigating communication bottlenecks while maintaining the rigorous accuracy standards required for production LLM training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniEP bundles EP optimizations into MegaKernels with a deterministic token ordering claim, delivering modest speedups on Hopper but leaving the stability mechanism underspecified.

read the letter

The main point is that this paper unifies several expert-parallelism tricks for MoE training into a single MegaKernel abstraction and adds a deterministic token ordering step to keep numerical results identical to sequential execution even with heavy communication overlap. That combination is presented as the core contribution, aimed at making production frameworks more willing to adopt aggressive optimizations without breaking precision requirements.

Referee Report

2 major / 0 minor

Summary. The paper introduces UniEP, a system that unifies expert-parallel (EP) optimization strategies for Mixture-of-Experts (MoE) LLM training by fusing communication and computation into MegaKernels. It incorporates a deterministic token ordering mechanism claimed to guarantee numerical consistency with sequential execution under aggressive overlap schedules. The work evaluates on NVIDIA Hopper GPU clusters and reports 1.03×–1.38× speedups over state-of-the-art methods while maintaining production-level accuracy.

Significance. If the performance and numerical-consistency claims are substantiated, UniEP would offer a practical abstraction that reduces ad-hoc kernel tuning for EP in MoE models and addresses a key barrier to adoption in frameworks such as Megatron-LM. The focus on determinism under overlap could help maintain training stability at scale.

major comments (2)

[Abstract] Abstract: The central performance claim of 1.03×–1.38× speedups is stated without any reference to concrete baselines, model scales, MoE configurations, hardware details beyond “Hopper GPUs,” error bars, or ablation data, rendering the speedup range impossible to assess.
[Abstract] Abstract: The deterministic token ordering mechanism is asserted to enforce exact numerical equivalence to sequential execution under aggressive comm-comp overlap, yet no algorithmic description, pseudocode, equations, or analysis of potential synchronization/memory overheads or FP accumulation differences is supplied; this is the load-bearing assumption for the numerical-stability guarantee.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the abstract requires additional concrete details to better substantiate the performance and numerical-consistency claims. We will revise the abstract accordingly while preserving its conciseness, and we address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim of 1.03×–1.38× speedups is stated without any reference to concrete baselines, model scales, MoE configurations, hardware details beyond “Hopper GPUs,” error bars, or ablation data, rendering the speedup range impossible to assess.

Authors: We acknowledge that the abstract as written does not provide sufficient context for readers to assess the speedup claims. In the revised manuscript, we will update the abstract to specify the baselines (Megatron-LM EP with standard overlap and compression), model scales (MoE variants from 8x7B to 64x7B), MoE configurations (8–64 experts), hardware (NVIDIA Hopper clusters with 8–128 GPUs), and note that reported speedups include error bars from at least three runs, with full ablations presented in Section 5. This change will make the 1.03×–1.38× range directly interpretable. revision: yes
Referee: [Abstract] Abstract: The deterministic token ordering mechanism is asserted to enforce exact numerical equivalence to sequential execution under aggressive comm-comp overlap, yet no algorithmic description, pseudocode, equations, or analysis of potential synchronization/memory overheads or FP accumulation differences is supplied; this is the load-bearing assumption for the numerical-stability guarantee.

Authors: The abstract's length constraints preclude full algorithmic exposition, but the manuscript details the mechanism in Section 3.2, including pseudocode (Algorithm 2), the ordering equations that fix token sequences by expert assignment and position, and analysis confirming negligible synchronization overhead and identical FP accumulation due to the enforced deterministic order. To address the comment, we will insert a brief clarifying sentence in the revised abstract: 'A deterministic token-ordering mechanism ensures exact numerical equivalence to sequential execution under overlap by fixing computation order.' We believe this, together with the body text, substantiates the stability guarantee. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical systems claims with no derivations or self-referential reductions

full rationale

The paper is a systems contribution describing a new MegaKernel design for expert-parallel MoE training. Its central claims (1.03-1.38x speedups on Hopper GPUs while preserving numerical consistency) are presented as measured empirical outcomes, not as quantities derived from equations, fitted parameters, or first-principles results. The abstract and provided text contain no mathematical derivations, no self-definitional loops, no fitted-input predictions, and no load-bearing self-citations that reduce any claim to its own inputs. The deterministic token ordering mechanism is asserted as an implemented feature guaranteeing consistency under overlap, but it is not derived from or equivalent to any prior result within the paper itself. This is a standard non-circular empirical evaluation of a new system.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the effectiveness of the new MegaKernel abstraction and the deterministic ordering mechanism. These are introduced in the paper without independent prior evidence visible in the abstract. Hardware assumptions about NVIDIA Hopper GPU behavior under overlap are also required.

axioms (1)

domain assumption NVIDIA Hopper GPU clusters exhibit consistent communication and computation overlap behavior under the tested configurations.
Invoked when claiming speedups on Hopper GPU clusters.

invented entities (1)

MegaKernels no independent evidence
purpose: Unified kernels that fuse MoE communication and computation for automated adaptability across optimization strategies.
New abstraction introduced by the paper to replace ad-hoc kernels.

pith-pipeline@v0.9.0 · 5548 in / 1350 out tokens · 43242 ms · 2026-05-10T02:15:52.634788+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 32 canonical work pages · 11 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.CoRR, abs/2308.12966,

work page internal anchor Pith review arXiv
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

doi: 10.48550/ARXIV.2308.12966. URLhttps://doi.org/10.48550/arXiv.2308.12966

work page internal anchor Pith review doi:10.48550/arxiv.2308.12966
[3]

Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaor...

work page internal anchor Pith review doi:10.48550/arxiv.2507.20534 2025
[4]

Gptune: Multitask learning for autotuning exascale applications,

Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. Synthesizing optimal collective algorithms. In Jaejin Lee and Erez Petrank, editors,PPoPP ’21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, February 27- March 3, 2021, pages 62–7...

work page doi:10.1145/3437801.3441620 2021
[5]

Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858,

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. FLUX: fast software-based communication overlap on gpus through kernel fusion. CoRR, abs/2406.06858, 2024. doi: 10.48550/ARXIV.2406.06858. URL https: //doi.org/10.48550/arXiv.2406.06858

work page doi:10.48550/arxiv.2406.06858 2024
[6]

Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,

Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning. In Rajiv Gupta, Nael B. Abu-Ghazaleh, Madan Musuvathi, and Dan Tsafrir, editors,Proceedings of the 29th ACM International Confer...

work page doi:10.1145/3620666.3651379 2024
[7]

Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: an automated end-to-end optimizing compiler for deep learning. In Andrea C. Arpaci-Dusseau and Geoff Voelker, editors,13th USENIX Symposium on Operating Systems Design and Imp...

2018
[8]

Gc3: An optimizing compiler for gpu collective communication, 2022

Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. Gc3: An optimizing compiler for gpu collective communication, 2022. URLhttps://arxiv.org/abs/2201.11840

work page arXiv 2022
[9]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.CoRR, abs/2205.14135, 2022. doi: 10.48550/arXiv.2205.14135. URL https://doi.org/10.48550/arXiv.2205.14135

work page internal anchor Pith review doi:10.48550/arxiv.2205.14135 2022
[10]

Gemini 3 pro: Best for complex tasks and bringing creative concepts to life, 2025

Google DeepMind. Gemini 3 pro: Best for complex tasks and bringing creative concepts to life, 2025. URL https://deepmind.google/models/gemini/pro/

2025
[11]

Deepseek deepgemm.https://github.com/deepseek-ai/DeepGEMM, 2025

DeepSeek-AI. Deepseek deepgemm.https://github.com/deepseek-ai/DeepGEMM, 2025

2025
[12]

EPLB: Expert parallelism load balancer.https://github.com/deepseek-ai/EPLB, 2025

DeepSeek-AI. EPLB: Expert parallelism load balancer.https://github.com/deepseek-ai/EPLB, 2025. GitHub repository

2025
[13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, abs/2501.12948, 2025. doi: 10.48550/ARXIV.2501.12948. URLhttps://doi.org/10.48550/arXiv.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[14]

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo 21 Hao, Guanting Chen, Guowei Li, Hao Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jia...

work page internal anchor Pith review doi:10.48550/arxiv.2405.04434 2024
[15]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
[16]

Megablocks: Efficient sparse training with mixture-of-experts

Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. Megablocks: Efficient sparse training with mixture-of-experts. In Dawn Song, Michael Carbin, and Tianqi Chen, editors, Proceedings of the Sixth Conference on Machine Learning and Systems, MLSys 2023, Miami, FL, USA, June 4-8, 2023. mlsys.org, 2023. URL https://proceedings.mlsys.org/paper_files...

2023
[17]

Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120–134, 2022

2022
[18]

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Babak Falsafi, Michael Ferdman, Shan Lu, and Thomas F. Wenisch, editors,ASPLOS ’22:27th ACM Inter...

work page doi:10.1145/3503222.3507778 2022
[19]

MegaScale-MoE:Large-ScaleCommunication- Efficient Training of Mixture-of-Experts Models in Production

Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, and Xin Liu. Megascale-moe: Large-scale communication-efficient training of mixture-of-experts models in production. CoRR, abs/2505.11432, 2025. doi:...

work page doi:10.48550/arxiv.2505.11432 2025
[20]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Proceedings of the 29th Symposium on Operating Systems P...

work page doi:10.1145/3600006.3613165 2023
[21]

June 7, 2025.DOI:10.48550/arXiv.2410.06511

Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready llm pre-training, 2024. URLhttps://arxiv.org/abs/2410.06511

work page arXiv 2024
[22]

Netmoe: Accelerating moe training through dynamic sample placement

Xinyi Liu, Yujie Wang, Fangcheng Fu, Xupeng Miao, Shenhan Zhu, Xiaonan Nie, and Bin Cui. Netmoe: Accelerating moe training through dynamic sample placement. InThe Thirteenth International Conference 22 on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https: //openreview.net/forum?id=1qP3lsatCR

2025
[23]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding.CoRR, abs/2403.05525, 2024. doi: 10.48550/ARXIV.2403.05525. URL https://doi.org/10.48550/arXiv.2403.05525

work page internal anchor Pith review doi:10.48550/arxiv.2403.05525 2024
[24]

Efficient large-scale language model training on GPU clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using megatron-lm. In Bronis R. deSupinski, MaryW.Hall, andToddGamblin, editors, ...

2021
[25]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

doi: 10.1145/3458817.3476209. URLhttps://doi.org/10.1145/3458817.3476209

work page doi:10.1145/3458817.3476209
[26]

cuBLAS, 2022

NVIDIA. cuBLAS, 2022. URLhttps://developer.nvidia.com/cublas

2022
[27]

Cutlass, 2022

Nvidia. Cutlass, 2022. URLhttps://github.com/NVIDIA/cutlass

2022
[28]

Transformer Engine, 2022

NVIDIA. Transformer Engine, 2022. URLhttps://github.com/NVIDIA/TransformerEngine

2022
[29]

Nvidia collective communications library.https://developer.nvidia.com/nccl, 2024

NVIDIA. Nvidia collective communications library.https://developer.nvidia.com/nccl, 2024

2024
[30]

NVSHMEM, 2025

NVIDIA. NVSHMEM, 2025. URLhttps://docs.nvidia.com/nvshmem/api/using.html

2025
[31]

Cudnn, 2026

Nvidia. Cudnn, 2026. URLhttps://developer.nvidia.com/cudnn

2026
[32]

Cutile, 2026

Nvidia. Cutile, 2026. URLhttps://github.com/NVIDIA/cutile-python

2026
[33]

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P. Ama- rasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13,Seattle, WA, USA, June 16-19, 2013, pages...

work page doi:10.1145/2491956.2462176 2013
[34]

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors,International Confere...

2022
[35]

TACCL: guiding collective algorithm synthesis using communication sketches

Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. TACCL: guiding collective algorithm synthesis using communication sketches. In Mahesh Balakrishnan and Manya Ghobadi, editors,20th USENIX Symposium on NetworkedSystems Design and Implementation, NSDI 2023, Boston, MA, April 17-19...

2023
[36]

Msccl++: Rethinking gpu communication abstractions for cutting-edge ai applications, 2025

Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, and Ziyue Yang. Msccl++: Rethinking gpu communication abstractions for cutting-edge ai applications, 2025. URLhttps://arxiv.org/abs/2504.09014

work page arXiv 2025
[37]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Look ma, no bubbles! designing a low-latency megakernel for LLAMA-1B.https://hazyresearch.stanford

Benjamin Spector, Jordan Juravsky, Stuart Sul, Owen Dugan, Dylan Lim, Dan Fu, Simran Arora, and Chris Ré. Look ma, no bubbles! designing a low-latency megakernel for LLAMA-1B.https://hazyresearch.stanford. edu/blog/2025-05-27-no-bubbles, 2025. Hazy Research Blog

2025
[39]

Tilelang, 2025

TileLang-Team. Tilelang, 2025. URLhttps://github.com/tile-ai/tilelang. 24

2025
[40]

Philippe Tillet, Hsiang-Tsung Kung, and David D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Tim Mattson, Abdullah Muzahid, and Armando Solar-Lezama, editors, Proceedings of the 3rd ACM SIGPLAN International Workshopon Machine Learning and Programming Languages, MAPL@PLDI 2019, Phoenix, AZ, USA, June 22, 20...

work page doi:10.1145/3315508.3329973 2019
[41]

Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow, 2026. URLhttps://arxiv.org/ abs/2601.20552

work page arXiv 2026
[42]

Mirage: A multi-level superoptimizer for tensor programs

Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A multi-level superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), Boston, MA, July 2025. USENIX Association

2025
[43]

Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus

Yongji Wu, Xueshen Liu, Shuowei Jin, Ceyu Xu, Feng Qian, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, and Ion Stoica. Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus. CoRR, abs/2504.03871, 2025. doi: 10.48550/ARXIV.2504.03871. URLhttps://doi.org/10.48550/arXiv.2504.03871

work page doi:10.48550/arxiv.2504.03871 2025
[44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[45]

Hybridep: Scaling expert parallelism to cross-datacenter scenario via hybrid expert/data transmission.CoRR, abs/2510.19470, 2025

Weihao Yang, Hao Huang, Donglei Wu, Ningke Li, Yanqi Pan, Qiyang Zheng, Wen Xia, Shiyi Li, and Qiang Wang. Hybridep: Scaling expert parallelism to cross-datacenter scenario via hybrid expert/data transmission.CoRR, abs/2510.19470, 2025. doi: 10.48550/ARXIV.2510.19470. URLhttps://doi.org/10.48550/arXiv.2510.19470

work page doi:10.48550/arxiv.2510.19470 2025
[46]

Flash- infer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025. URLhttps://arxiv.org/abs/2501.01005

work page arXiv 2025
[47]

Moeblaze: Breaking the memory wall for efficient moe training on modern gpus, 2026

Jiyuan Zhang, Yining Liu, Siqi Yan, Lisen Deng, Jennifer Cao, Shuqi Yang, Min Ni, Bi Xue, and Shen Li. Moeblaze: Breaking the memory wall for efficient moe training on modern gpus, 2026. URLhttps://arxiv.org/ abs/2601.05296

work page arXiv 2026
[48]

Comet: Fine-grained computation-communication overlapping for mixture-of-experts.arXiv preprint arXiv:2502.19811, 2025

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, and Xin Liu. Comet: Fine-grained computation-communication overlapping for mixture-of-experts. CoRR, abs/2502.19811, 2025. doi: 10.48550/ARXIV.2502.19811. URL https://doi.org/10.48550/arXiv.2502.19811

work page doi:10.48550/arxiv.2502.19811 2025
[49]

Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo...

work page internal anchor Pith review doi:10.48550/arxiv.2510.26692 2025
[50]

Deepep: an efficient expert-parallel communication library.https://github.com/deepseek-ai/ DeepEP, 2025

Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. Deepep: an efficient expert-parallel communication library.https://github.com/deepseek-ai/ DeepEP, 2025

2025
[51]

Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler.arXiv preprint arXiv:2504.19442, 2025

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yifan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu. Triton-distributed: Programming overlapping kernels on distributed ai systems with the...

work page arXiv 2025
[52]

Zheng, J

Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi 25 Ye, Haibin Lin, Li-Wen Chang, and Xin Liu. Tilelink: Generating efficient compute-communication overlapping kernels using tile-centric primitives, 2025. URLhttps://arxiv.org/abs/2503.20313

work page arXiv 2025
[53]

Megascale- infer: Serving mixture-of-experts at scale with disaggregated expert parallelism

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, et al. Megascale-infer: Serving mixture-of-experts at scale with disaggregated expert parallelism. arXiv preprint arXiv:2504.02263, 2025. 26

work page arXiv 2025