pith. machine review for the scientific record. sign in

arxiv: 2604.19241 · v1 · submitted 2026-04-21 · 💻 cs.DC

Recognition: unknown

UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:15 UTC · model grok-4.3

classification 💻 cs.DC
keywords mixture-of-expertsexpert parallelismmega kernelsLLM trainingcommunication overlapnumerical consistencymodel scalingparallel computing
0
0 comments X

The pith

Fusing communication and computation into MegaKernels for expert-parallel MoE models delivers speedups while enforcing exact numerical consistency with sequential runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniEP to handle the growing demands of training large mixture-of-experts language models where communication increasingly limits overall progress. It unifies multiple expert-parallelism strategies by merging communication steps and computation steps into single large MegaKernels. This change converts scattered tuning choices into one searchable parameter space that supports automated adjustments. The system adds a deterministic token ordering rule so that aggressive overlapping of operations still produces identical numerical results to running everything in sequence. The result is faster training on GPU clusters that still satisfies the strict accuracy needs of production-scale work.

Core claim

UniEP fuses the MoE communication and computation into MegaKernels, effectively transforming complex architectural tuning into a unified parameter search space for automated adaptability. It incorporates a deterministic token ordering mechanism that guarantees numerical consistency with sequential execution even under aggressive overlap schedules. Evaluations show that this approach achieves 1.03×-1.38× speedups over state-of-the-art methods while mitigating communication bottlenecks and meeting rigorous accuracy standards.

What carries the argument

MegaKernels that fuse MoE communication and computation, paired with a deterministic token ordering mechanism that preserves numerical identity under overlap.

If this is right

  • Communication bottlenecks in expert-parallel MoE training are reduced through fused kernels.
  • Architectural tuning becomes a single searchable space instead of separate ad-hoc choices.
  • Numerical accuracy remains equivalent to non-overlapped sequential execution.
  • Multiple expert-parallelism strategies can be applied uniformly without custom kernels for each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The unified parameter space could support automatic retuning when hardware or model sizes change without rewriting kernels.
  • Exact numerical matching enables reliable comparison of training runs that use different overlap levels.
  • Similar fusion of communication and computation might be applied to other parallelization methods beyond expert parallelism.
  • Reduced reliance on manual kernel design could shorten the time needed to scale models to new cluster sizes.

Load-bearing premise

That a deterministic token ordering mechanism can be realized to keep results identical to sequential execution during aggressive overlaps without adding hidden performance or stability costs across different configurations.

What would settle it

Running identical training inputs and random seeds once with the overlapped MegaKernel schedule and once with strict sequential execution, then checking whether loss curves, output values, or final weights match exactly.

read the original abstract

The exponential growth in Large Language Model (LLM) parameters has transformed model training into an increasingly resource-intensive endeavor. With the stagnation of Moore's Law and the widening disparity between computation throughput and communication bandwidth, expert parallelism (EP) has emerged as a critical strategy for scaling mixture-of-experts (MoE) models. However, despite numerous proposals for optimizing EP, ranging from communication compression to computation-communication overlap, adoption within production-grade frameworks like Megatron-LM remains conservative. Existing solutions often rely on ad-hoc, complex kernels that lack adaptability across diverse optimization configurations and frequently neglect numerical stability, failing to meet the strict precision requirements of large-scale training. In this paper, we introduce UniEP, a novel system that unifies diverse EP optimization strategies into a cohesive abstraction. UniEP fuses the MoE communication and computation into MegaKernels, effectively transforming complex architectural tuning into a unified parameter search space for automated adaptability. Crucially, UniEP incorporates a deterministic token ordering mechanism that guarantees numerical consistency with sequential execution, even under aggressive overlap schedules. We evaluate UniEP on GPU clusters equipped with NVIDIA Hopper GPUs. Our results demonstrate that UniEP achieves 1.03$\times$-1.38$\times$ speedups over state-of-the-art work, effectively mitigating communication bottlenecks while maintaining the rigorous accuracy standards required for production LLM training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces UniEP, a system that unifies expert-parallel (EP) optimization strategies for Mixture-of-Experts (MoE) LLM training by fusing communication and computation into MegaKernels. It incorporates a deterministic token ordering mechanism claimed to guarantee numerical consistency with sequential execution under aggressive overlap schedules. The work evaluates on NVIDIA Hopper GPU clusters and reports 1.03×–1.38× speedups over state-of-the-art methods while maintaining production-level accuracy.

Significance. If the performance and numerical-consistency claims are substantiated, UniEP would offer a practical abstraction that reduces ad-hoc kernel tuning for EP in MoE models and addresses a key barrier to adoption in frameworks such as Megatron-LM. The focus on determinism under overlap could help maintain training stability at scale.

major comments (2)
  1. [Abstract] Abstract: The central performance claim of 1.03×–1.38× speedups is stated without any reference to concrete baselines, model scales, MoE configurations, hardware details beyond “Hopper GPUs,” error bars, or ablation data, rendering the speedup range impossible to assess.
  2. [Abstract] Abstract: The deterministic token ordering mechanism is asserted to enforce exact numerical equivalence to sequential execution under aggressive comm-comp overlap, yet no algorithmic description, pseudocode, equations, or analysis of potential synchronization/memory overheads or FP accumulation differences is supplied; this is the load-bearing assumption for the numerical-stability guarantee.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the abstract requires additional concrete details to better substantiate the performance and numerical-consistency claims. We will revise the abstract accordingly while preserving its conciseness, and we address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim of 1.03×–1.38× speedups is stated without any reference to concrete baselines, model scales, MoE configurations, hardware details beyond “Hopper GPUs,” error bars, or ablation data, rendering the speedup range impossible to assess.

    Authors: We acknowledge that the abstract as written does not provide sufficient context for readers to assess the speedup claims. In the revised manuscript, we will update the abstract to specify the baselines (Megatron-LM EP with standard overlap and compression), model scales (MoE variants from 8x7B to 64x7B), MoE configurations (8–64 experts), hardware (NVIDIA Hopper clusters with 8–128 GPUs), and note that reported speedups include error bars from at least three runs, with full ablations presented in Section 5. This change will make the 1.03×–1.38× range directly interpretable. revision: yes

  2. Referee: [Abstract] Abstract: The deterministic token ordering mechanism is asserted to enforce exact numerical equivalence to sequential execution under aggressive comm-comp overlap, yet no algorithmic description, pseudocode, equations, or analysis of potential synchronization/memory overheads or FP accumulation differences is supplied; this is the load-bearing assumption for the numerical-stability guarantee.

    Authors: The abstract's length constraints preclude full algorithmic exposition, but the manuscript details the mechanism in Section 3.2, including pseudocode (Algorithm 2), the ordering equations that fix token sequences by expert assignment and position, and analysis confirming negligible synchronization overhead and identical FP accumulation due to the enforced deterministic order. To address the comment, we will insert a brief clarifying sentence in the revised abstract: 'A deterministic token-ordering mechanism ensures exact numerical equivalence to sequential execution under overlap by fixing computation order.' We believe this, together with the body text, substantiates the stability guarantee. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical systems claims with no derivations or self-referential reductions

full rationale

The paper is a systems contribution describing a new MegaKernel design for expert-parallel MoE training. Its central claims (1.03-1.38x speedups on Hopper GPUs while preserving numerical consistency) are presented as measured empirical outcomes, not as quantities derived from equations, fitted parameters, or first-principles results. The abstract and provided text contain no mathematical derivations, no self-definitional loops, no fitted-input predictions, and no load-bearing self-citations that reduce any claim to its own inputs. The deterministic token ordering mechanism is asserted as an implemented feature guaranteeing consistency under overlap, but it is not derived from or equivalent to any prior result within the paper itself. This is a standard non-circular empirical evaluation of a new system.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the effectiveness of the new MegaKernel abstraction and the deterministic ordering mechanism. These are introduced in the paper without independent prior evidence visible in the abstract. Hardware assumptions about NVIDIA Hopper GPU behavior under overlap are also required.

axioms (1)
  • domain assumption NVIDIA Hopper GPU clusters exhibit consistent communication and computation overlap behavior under the tested configurations.
    Invoked when claiming speedups on Hopper GPU clusters.
invented entities (1)
  • MegaKernels no independent evidence
    purpose: Unified kernels that fuse MoE communication and computation for automated adaptability across optimization strategies.
    New abstraction introduced by the paper to replace ad-hoc kernels.

pith-pipeline@v0.9.0 · 5548 in / 1350 out tokens · 43242 ms · 2026-05-10T02:15:52.634788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 32 canonical work pages · 11 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.CoRR, abs/2308.12966,

  2. [2]
  3. [3]

    Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaor...

  4. [4]

    Gptune: Multitask learning for autotuning exascale applications,

    Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. Synthesizing optimal collective algorithms. In Jaejin Lee and Erez Petrank, editors,PPoPP ’21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, February 27- March 3, 2021, pages 62–7...

  5. [5]

    Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858,

    Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. FLUX: fast software-based communication overlap on gpus through kernel fusion. CoRR, abs/2406.06858, 2024. doi: 10.48550/ARXIV.2406.06858. URL https: //doi.org/10.48550/arXiv.2406.06858

  6. [6]

    Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,

    Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning. In Rajiv Gupta, Nael B. Abu-Ghazaleh, Madan Musuvathi, and Dan Tsafrir, editors,Proceedings of the 29th ACM International Confer...

  7. [7]

    Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: an automated end-to-end optimizing compiler for deep learning. In Andrea C. Arpaci-Dusseau and Geoff Voelker, editors,13th USENIX Symposium on Operating Systems Design and Imp...

  8. [8]

    Gc3: An optimizing compiler for gpu collective communication, 2022

    Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. Gc3: An optimizing compiler for gpu collective communication, 2022. URLhttps://arxiv.org/abs/2201.11840

  9. [9]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.CoRR, abs/2205.14135, 2022. doi: 10.48550/arXiv.2205.14135. URL https://doi.org/10.48550/arXiv.2205.14135

  10. [10]

    Gemini 3 pro: Best for complex tasks and bringing creative concepts to life, 2025

    Google DeepMind. Gemini 3 pro: Best for complex tasks and bringing creative concepts to life, 2025. URL https://deepmind.google/models/gemini/pro/

  11. [11]

    Deepseek deepgemm.https://github.com/deepseek-ai/DeepGEMM, 2025

    DeepSeek-AI. Deepseek deepgemm.https://github.com/deepseek-ai/DeepGEMM, 2025

  12. [12]

    EPLB: Expert parallelism load balancer.https://github.com/deepseek-ai/EPLB, 2025

    DeepSeek-AI. EPLB: Expert parallelism load balancer.https://github.com/deepseek-ai/EPLB, 2025. GitHub repository

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, abs/2501.12948, 2025. doi: 10.48550/ARXIV.2501.12948. URLhttps://doi.org/10.48550/arXiv.2501.12948

  14. [14]

    DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo 21 Hao, Guanting Chen, Guowei Li, Hao Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jia...

  15. [15]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  16. [16]

    Megablocks: Efficient sparse training with mixture-of-experts

    Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. Megablocks: Efficient sparse training with mixture-of-experts. In Dawn Song, Michael Carbin, and Tianqi Chen, editors, Proceedings of the Sixth Conference on Machine Learning and Systems, MLSys 2023, Miami, FL, USA, June 4-8, 2023. mlsys.org, 2023. URL https://proceedings.mlsys.org/paper_files...

  17. [17]

    Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models

    Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120–134, 2022

  18. [18]

    Breaking the computation and communication abstraction barrier in distributed machine learning workloads

    Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Babak Falsafi, Michael Ferdman, Shan Lu, and Thomas F. Wenisch, editors,ASPLOS ’22:27th ACM Inter...

  19. [19]

    MegaScale-MoE:Large-ScaleCommunication- Efficient Training of Mixture-of-Experts Models in Production

    Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, and Xin Liu. Megascale-moe: Large-scale communication-efficient training of mixture-of-experts models in production. CoRR, abs/2505.11432, 2025. doi:...

  20. [20]

    In: Proceedings of the 29th Symposium on Operating Systems Principles

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Proceedings of the 29th Symposium on Operating Systems P...

  21. [21]

    June 7, 2025.DOI:10.48550/arXiv.2410.06511

    Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready llm pre-training, 2024. URLhttps://arxiv.org/abs/2410.06511

  22. [22]

    Netmoe: Accelerating moe training through dynamic sample placement

    Xinyi Liu, Yujie Wang, Fangcheng Fu, Xupeng Miao, Shenhan Zhu, Xiaonan Nie, and Bin Cui. Netmoe: Accelerating moe training through dynamic sample placement. InThe Thirteenth International Conference 22 on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https: //openreview.net/forum?id=1qP3lsatCR

  23. [23]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding.CoRR, abs/2403.05525, 2024. doi: 10.48550/ARXIV.2403.05525. URL https://doi.org/10.48550/arXiv.2403.05525

  24. [24]

    Efficient large-scale language model training on GPU clusters using megatron-lm

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using megatron-lm. In Bronis R. deSupinski, MaryW.Hall, andToddGamblin, editors, ...

  25. [25]
  26. [26]

    cuBLAS, 2022

    NVIDIA. cuBLAS, 2022. URLhttps://developer.nvidia.com/cublas

  27. [27]

    Cutlass, 2022

    Nvidia. Cutlass, 2022. URLhttps://github.com/NVIDIA/cutlass

  28. [28]

    Transformer Engine, 2022

    NVIDIA. Transformer Engine, 2022. URLhttps://github.com/NVIDIA/TransformerEngine

  29. [29]

    Nvidia collective communications library.https://developer.nvidia.com/nccl, 2024

    NVIDIA. Nvidia collective communications library.https://developer.nvidia.com/nccl, 2024

  30. [30]

    NVSHMEM, 2025

    NVIDIA. NVSHMEM, 2025. URLhttps://docs.nvidia.com/nvshmem/api/using.html

  31. [31]

    Cudnn, 2026

    Nvidia. Cudnn, 2026. URLhttps://developer.nvidia.com/cudnn

  32. [32]

    Cutile, 2026

    Nvidia. Cutile, 2026. URLhttps://github.com/NVIDIA/cutile-python

  33. [33]

    Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,

    Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P. Ama- rasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13,Seattle, WA, USA, June 16-19, 2013, pages...

  34. [34]

    Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors,International Confere...

  35. [35]

    TACCL: guiding collective algorithm synthesis using communication sketches

    Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. TACCL: guiding collective algorithm synthesis using communication sketches. In Mahesh Balakrishnan and Manya Ghobadi, editors,20th USENIX Symposium on NetworkedSystems Design and Implementation, NSDI 2023, Boston, MA, April 17-19...

  36. [36]

    Msccl++: Rethinking gpu communication abstractions for cutting-edge ai applications, 2025

    Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, and Ziyue Yang. Msccl++: Rethinking gpu communication abstractions for cutting-edge ai applications, 2025. URLhttps://arxiv.org/abs/2504.09014

  37. [37]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

  38. [38]

    Look ma, no bubbles! designing a low-latency megakernel for LLAMA-1B.https://hazyresearch.stanford

    Benjamin Spector, Jordan Juravsky, Stuart Sul, Owen Dugan, Dylan Lim, Dan Fu, Simran Arora, and Chris Ré. Look ma, no bubbles! designing a low-latency megakernel for LLAMA-1B.https://hazyresearch.stanford. edu/blog/2025-05-27-no-bubbles, 2025. Hazy Research Blog

  39. [39]

    Tilelang, 2025

    TileLang-Team. Tilelang, 2025. URLhttps://github.com/tile-ai/tilelang. 24

  40. [40]

    Philippe Tillet, Hsiang-Tsung Kung, and David D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Tim Mattson, Abdullah Muzahid, and Armando Solar-Lezama, editors, Proceedings of the 3rd ACM SIGPLAN International Workshopon Machine Learning and Programming Languages, MAPL@PLDI 2019, Phoenix, AZ, USA, June 22, 20...

  41. [41]

    Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow, 2026. URLhttps://arxiv.org/ abs/2601.20552

  42. [42]

    Mirage: A multi-level superoptimizer for tensor programs

    Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A multi-level superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), Boston, MA, July 2025. USENIX Association

  43. [43]

    Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus

    Yongji Wu, Xueshen Liu, Shuowei Jin, Ceyu Xu, Feng Qian, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, and Ion Stoica. Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus. CoRR, abs/2504.03871, 2025. doi: 10.48550/ARXIV.2504.03871. URLhttps://doi.org/10.48550/arXiv.2504.03871

  44. [44]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  45. [45]

    Hybridep: Scaling expert parallelism to cross-datacenter scenario via hybrid expert/data transmission.CoRR, abs/2510.19470, 2025

    Weihao Yang, Hao Huang, Donglei Wu, Ningke Li, Yanqi Pan, Qiyang Zheng, Wen Xia, Shiyi Li, and Qiang Wang. Hybridep: Scaling expert parallelism to cross-datacenter scenario via hybrid expert/data transmission.CoRR, abs/2510.19470, 2025. doi: 10.48550/ARXIV.2510.19470. URLhttps://doi.org/10.48550/arXiv.2510.19470

  46. [46]

    Flash- infer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025. URLhttps://arxiv.org/abs/2501.01005

  47. [47]

    Moeblaze: Breaking the memory wall for efficient moe training on modern gpus, 2026

    Jiyuan Zhang, Yining Liu, Siqi Yan, Lisen Deng, Jennifer Cao, Shuqi Yang, Min Ni, Bi Xue, and Shen Li. Moeblaze: Breaking the memory wall for efficient moe training on modern gpus, 2026. URLhttps://arxiv.org/ abs/2601.05296

  48. [48]

    Comet: Fine-grained computation-communication overlapping for mixture-of-experts.arXiv preprint arXiv:2502.19811, 2025

    Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, and Xin Liu. Comet: Fine-grained computation-communication overlapping for mixture-of-experts. CoRR, abs/2502.19811, 2025. doi: 10.48550/ARXIV.2502.19811. URL https://doi.org/10.48550/arXiv.2502.19811

  49. [49]

    Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo...

  50. [50]

    Deepep: an efficient expert-parallel communication library.https://github.com/deepseek-ai/ DeepEP, 2025

    Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. Deepep: an efficient expert-parallel communication library.https://github.com/deepseek-ai/ DeepEP, 2025

  51. [51]

    Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler.arXiv preprint arXiv:2504.19442, 2025

    Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yifan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu. Triton-distributed: Programming overlapping kernels on distributed ai systems with the...

  52. [52]

    Zheng, J

    Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi 25 Ye, Haibin Lin, Li-Wen Chang, and Xin Liu. Tilelink: Generating efficient compute-communication overlapping kernels using tile-centric primitives, 2025. URLhttps://arxiv.org/abs/2503.20313

  53. [53]

    Megascale- infer: Serving mixture-of-experts at scale with disaggregated expert parallelism

    Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, et al. Megascale-infer: Serving mixture-of-experts at scale with disaggregated expert parallelism. arXiv preprint arXiv:2504.02263, 2025. 26