arxiv: 2605.11005 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI· cs.DC

Recognition: 2 theorem links

· Lean Theorem

DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

Zhichen Zeng , Chi-Chih Chang , Jiayi Wang , Zezhou Wang , Ningxin Zheng , Zheng Zhong , Cesar A. Stuardo , Dongyang Wang

show 5 more authors

Mohamed S. Abdelfattah Haibin Lin Banghua Zhu Ang Li Ziheng Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC

keywords Mixture of Expertsexpert parallelismdistributed trainingcomputation-communication overlappipeline parallelismMoE trainingall-to-all communication

0 comments

The pith

DisagMoE disaggregates attention and FFN layers to overlap computation with all-to-all communication in MoE training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-experts models distribute experts across GPUs but suffer from expensive all-to-all communications that stall training when network bandwidth is limited. The paper proposes separating the attention layers and the feed-forward network layers into two separate groups of GPUs. It then connects these groups with a multi-stage uni-directional pipeline that performs many-to-many communications in one direction only. A roofline model guides how much GPU and network capacity to assign to each group so that computation hides the communication time as much as possible. This yields measurable speedups over earlier overlap strategies that left residual stalls.

Core claim

DisagMoE separates attention and FFN layers into disjoint GPU groups, introduces a multi-stage pipeline with uni-directional many-to-many communications, and uses a computation-communication roofline model to balance resources, thereby achieving up to 1.8x speedup in training efficiency on 16-node 8xH800 clusters.

What carries the argument

The disaggregated AF-Pipe parallelism that assigns attention and FFN to separate GPU groups and schedules uni-directional pipeline communications between them while using a roofline to tune bandwidth allocation.

Load-bearing premise

That the overhead of splitting layers across groups and managing the multi-stage pipeline stays smaller than the communication time saved by the improved overlap.

What would settle it

Running the same MoE models on the same 16-node cluster with standard expert parallelism and measuring whether DisagMoE still shows lower iteration time or if the speedup disappears.

read the original abstract

Mixture-of-experts (MoE) architectures enable trillion-parameter LLMs with sparsely activated experts. Expert parallelism (EP) is a widely adopted MoE training strategy, but it suffers from severe all-to-all communication bottlenecks, which is exaggerated by the limited inter-node network bandwidth as the growing model size requires distributing experts across GPU nodes. Prior work focused on overlapping these all-to-all communications with feed-forward network (FFN) and self-attention computations, which often leaves residual network-bound stalls due to inherent imbalance in attention and FFN layers' computation-communication ratios. We present DisagMoE, a disaggregated MoE training system that jointly optimizes model placement and scheduling for maximal efficiency. DisagMoE separates attention and FFN layers into disjoint GPU groups, introduces a multi-stage pipeline with uni-directional, many-to-many communications, and employs a computation-communication roofline model to balance GPU and network bandwidth allocation among the attention and FFN groups. DisagMoE is implemented on Megatron-LM, and evaluation shows that DisagMoE improves training efficiency across multiple MoE models with up to 1.8x speedup on 16-node 8xH800 clusters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DisagMoE splits attention and FFN layers across separate GPU groups with a uni-directional pipeline and roofline balancing to cut MoE all-to-all stalls, but the 1.8x speedup claim rests on limited 16-node runs without enough baseline or variance details.

read the letter

The main takeaway is that this work disaggregates attention from FFN layers into distinct GPU pools, adds a multi-stage uni-directional many-to-many pipeline, and uses a roofline model to allocate bandwidth between the groups. That setup aims to fix the residual stalls left by earlier overlap methods that kept layers together on the same GPUs. They built it on Megatron-LM and measured up to 1.8x training speedup on 16-node 8xH800 clusters across several MoE models. The implementation itself is the concrete part worth noting, since it turns the disaggregation idea into runnable code rather than just a sketch. The roofline allocator for balancing compute and network across the two groups is a reasonable engineering step that directly targets the imbalance problem described in the abstract. Evaluation on real hardware with multiple models gives it some grounding beyond pure simulation. The soft spots sit mostly in the experimental side. The runs stop at 16 nodes, so it is unclear whether the new cross-group synchronization and pipeline stages stay cheap when node count grows and attention-to-FFN ratios shift across layers or steps. The abstract gives no baseline details, error bars, or ablation on the roofline predictions, which makes it hard to judge how much of the reported gain comes from the disaggregation versus other tuning. If the pipeline bubbles or extra all-to-all patterns offset the communication savings at larger scale, the net win shrinks. This paper is aimed at systems people who already work on expert parallelism and cluster scheduling for sparse models. Anyone tuning Megatron-style training or designing next-generation interconnects could pick up the placement and scheduling tricks. It is worth sending to peer review because the core idea is a distinct departure from prior overlap work and the implementation is real, even if the current evidence needs more scaling data and controls to stand up fully.

Referee Report

3 major / 2 minor

Summary. The paper presents DisagMoE, a system for efficient MoE training that disaggregates attention and FFN layers into separate GPU groups, introduces a multi-stage uni-directional many-to-many pipeline to overlap computation and communication, and applies a roofline model to allocate GPU and network bandwidth. It reports up to 1.8x speedup over prior overlap methods when evaluated on multiple MoE models using 16-node 8xH800 clusters, implemented on top of Megatron-LM.

Significance. If the reported speedups and overlap claims hold under broader validation, the disaggregation strategy could meaningfully advance scalable training of large MoE models by reducing residual communication stalls that persist in existing attention-FFN overlap techniques. The approach directly targets inter-node all-to-all bottlenecks that grow with model size and limited network bandwidth.

major comments (3)

[Abstract and Evaluation] Abstract and Evaluation section: the central claim of up to 1.8x speedup is stated without any description of the exact baselines (e.g., which prior overlap methods or configurations), measurement methodology, error bars, or number of runs, preventing assessment of whether the gains are statistically robust or reproducible.
[Method and Roofline Model] Method and Roofline Model sections: the roofline model is invoked to balance bandwidth allocation between attention and FFN groups, yet no equations, fitted parameters, or empirical validation against measured hardware counters are supplied, leaving open whether it accurately predicts the claimed perfect overlap or merely restates the observed wall-clock times.
[Evaluation and Scalability] Evaluation and Scalability discussion: experiments are confined to 16 nodes; the weakest assumption—that disaggregation plus the uni-directional pipeline introduces no offsetting synchronization, load-imbalance, or pipeline-bubble overheads—is therefore untested at larger scales where attention-to-FFN ratios may fluctuate more severely across layers.

minor comments (2)

[Implementation] Implementation details: while integration with Megatron-LM is mentioned, the manuscript would benefit from pseudocode or a diagram clarifying the exact scheduling of the multi-stage uni-directional many-to-many communications.
[Figures] Figures: performance plots should include per-layer or per-step breakdowns of compute versus communication time to substantiate the claimed elimination of residual stalls.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive feedback on our paper. We address each of the major comments below and will incorporate revisions to strengthen the manuscript's clarity and completeness.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the central claim of up to 1.8x speedup is stated without any description of the exact baselines (e.g., which prior overlap methods or configurations), measurement methodology, error bars, or number of runs, preventing assessment of whether the gains are statistically robust or reproducible.

Authors: We agree with this observation. The manuscript's abstract and evaluation section do not provide sufficient details on the baselines used for the 1.8x speedup claim. We will revise both the abstract and the evaluation section to explicitly describe the baselines (including the standard Megatron-LM implementation with computation-communication overlap for all-to-all), the measurement methodology (average throughput over training iterations), include error bars from repeated runs, and specify the number of runs performed. This will allow readers to better assess the robustness of the reported gains. revision: yes
Referee: [Method and Roofline Model] Method and Roofline Model sections: the roofline model is invoked to balance bandwidth allocation between attention and FFN groups, yet no equations, fitted parameters, or empirical validation against measured hardware counters are supplied, leaving open whether it accurately predicts the claimed perfect overlap or merely restates the observed wall-clock times.

Authors: The referee correctly points out the lack of detail in the roofline model. We will update the method section to include the specific equations used for the roofline analysis, the fitted parameters based on our hardware profiling, and empirical validation by comparing the model's predictions to actual hardware performance counters. This revision will demonstrate that the model is predictive and not merely descriptive of the results. revision: yes
Referee: [Evaluation and Scalability] Evaluation and Scalability discussion: experiments are confined to 16 nodes; the weakest assumption—that disaggregation plus the uni-directional pipeline introduces no offsetting synchronization, load-imbalance, or pipeline-bubble overheads—is therefore untested at larger scales where attention-to-FFN ratios may fluctuate more severely across layers.

Authors: We partially agree. While our experiments are limited to 16 nodes, the disaggregated design and uni-directional pipeline are intended to mitigate synchronization and bubble issues through bandwidth balancing. We will revise the evaluation section to include a more thorough discussion of these assumptions, supported by analytical models of pipeline overheads, and explicitly state the limitation regarding larger scales. However, we maintain that the core benefits are demonstrated at the evaluated scale, and the approach is designed to be scalable. revision: partial

standing simulated objections not resolved

Empirical evaluation at cluster scales significantly larger than 16 nodes, which is beyond our current resource availability.

Circularity Check

0 steps flagged

No circularity: claims rest on measured wall-clock speedups from implemented system, not derivations or fitted predictions.

full rationale

The paper describes a disaggregated MoE training architecture (separate attention/FFN GPU groups, uni-directional many-to-many pipeline, roofline-based bandwidth allocation) and reports empirical speedups up to 1.8x on 16-node clusters. No equations, fitted parameters, or first-principles derivations are present that could reduce to self-definition or self-citation chains. The roofline model functions as an engineering heuristic for runtime allocation rather than a predictive result whose validity loops back to the paper's own inputs. All load-bearing claims are externally falsifiable via hardware measurements independent of any internal fit or prior self-citation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that all-to-all communication remains the dominant bottleneck even after prior overlap techniques, plus the engineering choice of how to allocate GPUs between attention and FFN groups via the roofline model.

free parameters (1)

attention-to-FFN GPU allocation ratio
Chosen via the computation-communication roofline model to balance the two groups; exact values not stated in abstract.

axioms (1)

domain assumption All-to-all communication is the primary limiter in expert parallelism and prior overlap methods leave residual stalls due to layer imbalance.
Directly stated in the abstract as the motivation for disaggregation.

pith-pipeline@v0.9.0 · 5572 in / 1293 out tokens · 43088 ms · 2026-05-13T01:05:00.390938+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DisagMoE separates attention and FFN layers into disjoint GPU groups, introduces a multi-stage pipeline with uni-directional, many-to-many communications, and employs a computation-communication roofline model to balance GPU and network bandwidth allocation among the attention and FFN groups.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AF-Pipe adopts the AF disaggregated architecture... treats them as a first-class stage alongside attention and FFN compute, aligning stage boundaries across groups to systematically overlap communication with both computations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 9 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, YuBai, BowenBaker, HaimingBao, etal. gpt- oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

How to scale your model

Jacob Austin, Sholto Douglas, Roy Frostig, Anselm Levskaya, CharlieChen, SharadVikram, FedericoLe- bron, Peter Choy, Vinay Ramasesh, Albert Webson, and Reiner Pope. How to scale your model. 2025. Re- trieved from https://jax-ml.github.io/scaling-book/

work page 2025
[3]

Palm: Scal- ing language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scal- ing language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023

work page 2023
[4]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

work page 2022
[6]

DeepSeek-V3.2: Efficient Rea- soning & Agentic AI

DeepSeek-AI. DeepSeek-V3.2: Efficient Rea- soning & Agentic AI. https://huggingface. co/deepseek-ai/DeepSeek-V3.2, 2025. Accessed: 2026-05-02

work page 2025
[7]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report, 2025. URLhttps://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

DeepSeek-V4-Pro

DeepSeek-AI. DeepSeek-V4-Pro. https:// huggingface.co/deepseek-ai/DeepSeek-V4-Pro,

work page
[9]

Accessed: 2026-05-02

work page 2026
[10]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[11]

Rdma over ethernet for distributed training at meta scale

Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, et al. Rdma over ethernet for distributed training at meta scale. InProceedings of the ACMSIGCOMM 2024 Conference, pages 57–70, 2024

work page 2024
[12]

Gurobi Optimizer Refer- ence Manual, 2026

Gurobi Optimization, LLC. Gurobi Optimizer Refer- ence Manual, 2026. URLhttps://www.gurobi.com

work page 2026
[13]

Faster- moe: modeling and optimizing training of large-scale dynamic pre-trained models

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Faster- moe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120–134, 2022

work page 2022
[14]

Tu- tel: Adaptive mixture-of-experts at scale

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. Tu- tel: Adaptive mixture-of-experts at scale. CoRR, abs/2206.03382, June 2022. URL https://arxiv. org/pdf/2206.03382.pdf

work page arXiv 2022
[15]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Lancet: Ac- celerating mixture-of-experts training via whole graph computation-communication overlapping

Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. Lancet: Ac- celerating mixture-of-experts training via whole graph computation-communication overlapping. Proceedings of Machine Learning and Systems, 6: 74–86, 2024

work page 2024
[17]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional com- putation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[18]

Effi- cient large-scale language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Effi- cient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the international conference for high performance computing, networking,...

work page 2021
[19]

NVIDIA. Gdrcopy. https://github.com/NVIDIA/ gdrcopy, 2025. Accessed: 2025-10-27

work page 2025
[20]

Gpudirect

NVIDIA. Gpudirect. https://developer.nvidia. com/gpudirect, 2025. Accessed: 2025-10-27

work page 2025
[21]

Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving.https://qwen.ai/blog?id= qwen3.6-max-preview, April 2026

Qwen Team. Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving.https://qwen.ai/blog?id= qwen3.6-max-preview, April 2026. Accessed: 2026- 05-02

work page 2026
[22]

Zero: Memory optimizations to- ward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations to- ward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020. 13

work page 2020
[23]

Deepspeed- moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ah- mad Awan, Jeff Rasley, and Yuxiong He. Deepspeed- moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning, pages 18332–18346. PMLR, 2022

work page 2022
[24]

Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505– 3506, 2020

work page 2020
[25]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[27]

ByteDance Seed Team. Seed2. 0 model card: To- wards intelligence frontier for real-world complexity, february 2026. Model Card

work page 2026
[28]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026
[29]

Step-3 is large yet affordable: Model-system co-design for cost- effective decoding.arXiv preprint arXiv:2507.19427, 2025

Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Min- gliang Li, Nuo Chen, Siyu Chen, et al. Step-3 is large yet affordable: Model-system co-design for cost- effective decoding.arXiv preprint arXiv:2507.19427, 2025

work page arXiv 2025
[30]

Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus

Yongji Wu, Xueshen Liu, Shuowei Jin, Ceyu Xu, Feng Qian, Z Morley Mao, Matthew Lentz, Danyang Zhuo, and Ion Stoica. Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus. arXiv preprint arXiv:2504.03871, 2025

work page arXiv 2025
[31]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Z.ai. GLM-5.1. https://huggingface.co/zai-org/ GLM-5.1, 2026. Accessed: 2026-05-02

work page 2026
[33]

Comet: Fine-grained computation-communication overlapping for mixture-of-experts.arXiv preprint arXiv:2502.19811, 2025

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, et al. Comet: Fine-grained computation-communication overlapping for mixture-of-experts.arXiv preprint arXiv:2502.19811, 2025

work page arXiv 2025
[34]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data par- allel. arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Megascale- infer: Serving mixture-of-experts at scale with disaggregated expert parallelism

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Ce- sar A Stuardo, Dongyang Wang, Xinlei Zhang, Huap- ing Zhou, Haoran Wei, Yang Cheng, et al. Megascale- infer: Serving mixture-of-experts at scale with disaggregated expert parallelism. arXiv preprint arXiv:2504.02263, 2025. 14

work page arXiv 2025