GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference
Pith reviewed 2026-05-18 11:58 UTC · model grok-4.3
The pith
GRACE-MoE combines expert grouping, dynamic replication, and locality-aware routing to cut distributed SMoE inference latency up to 4.66x without accuracy loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GRACE-MoE is a lossless co-optimization framework that integrates expert grouping to reduce communication and dynamic replication to correct load skew, together with locality-aware routing to resolve replica selection. To underpin this coordinated optimization in multi-node settings, GRACE-MoE adopts a hierarchical sparse communication design that reduces cross-node traffic while implicitly aligning execution across nodes, thereby mitigating synchronization overhead.
What carries the argument
The GRACE-MoE coordination of expert grouping, dynamic replication, and locality-aware routing, backed by hierarchical sparse communication that cuts cross-node traffic and aligns node execution.
If this is right
- End-to-end inference latency drops in multi-node multi-GPU clusters.
- Speedups reach 4.66 times versus prior distributed systems on varied models.
- The optimizations preserve model accuracy while improving hardware efficiency.
- The approach scales to different SMoE architectures without redesign.
Where Pith is reading between the lines
- Similar grouping and replication tactics could transfer to other communication-bound distributed training or serving workloads.
- The implicit alignment property may reduce the need for explicit barriers in future large-scale expert systems.
- Public release of the code would allow direct measurement of gains on new hardware generations.
Load-bearing premise
The hierarchical sparse communication design reduces cross-node traffic while implicitly aligning execution across nodes without introducing new synchronization bottlenecks or correctness issues.
What would settle it
A multi-node run that shows either higher synchronization overhead than single-node baselines or incorrect model outputs would falsify the claim that the hierarchical design works without side effects.
Figures
read the original abstract
Sparse Mixture of Experts (SMoE) enables scalable parameter growth in large language models (LLMs) by selectively activating a subset of experts, and its large parameter count necessitates distributed deployment for inference. However, distributed inference faces a critical dilemma: although communication overhead constitutes the primary bottleneck, reducing it often exacerbates computational load imbalance, leading to resource waste. In this paper, we present GRACE-MoE, which stands for Grouping and Replication with Locality-Aware Routing for SMoE inference. GRACE-MoE is a lossless co-optimization framework that integrates expert grouping to reduce communication and dynamic replication to correct load skew, together with locality-aware routing to resolve replica selection. To underpin this coordinated optimization in multi-node settings, GRACE-MoE adopts a hierarchical sparse communication design that reduces cross-node traffic while implicitly aligning execution across nodes, thereby mitigating synchronization overhead. Experiments on diverse models and multi-node, multi-GPU environments demonstrate that GRACE-MoE efficiently reduces end-to-end inference latency, achieving up to 4.66x speedup over existing systems, and the code will be released upon acceptance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents GRACE-MoE, a lossless co-optimization framework for distributed Sparse Mixture of Experts (SMoE) inference. It integrates expert grouping to reduce communication overhead, dynamic replication to correct load skew, locality-aware routing to resolve replica selection, and a hierarchical sparse communication design that reduces cross-node traffic while implicitly aligning execution across nodes to mitigate synchronization overhead. Experiments on diverse models and multi-node multi-GPU environments are reported to achieve up to 4.66x speedup over existing systems, with code to be released upon acceptance.
Significance. If the performance claims hold with detailed, reproducible validation, the work would be significant for practical distributed inference of large MoE models. It directly targets the communication-versus-load-imbalance dilemma through coordinated grouping, replication, and routing, potentially improving end-to-end latency in multi-node GPU clusters while preserving model correctness. The hierarchical sparse communication approach, if shown to avoid new bottlenecks, could inform future systems designs for scalable LLM serving.
major comments (2)
- Abstract: The abstract states experimental speedups including the 4.66x figure but provides no details on experimental setup, baselines, variance, or whether the figure holds after controlling for implementation differences; only abstract-level claims are available so central performance numbers cannot be verified.
- Hierarchical sparse communication design (described in the multi-node setting): The premise that this design reduces cross-node traffic while implicitly aligning execution across nodes without introducing new synchronization bottlenecks or correctness issues is invoked to support the multi-node co-optimization and speedup claim, yet no explicit latency breakdowns, barrier timings, or before/after comparisons for synchronization costs are provided.
minor comments (1)
- Abstract: The phrase 'lossless co-optimization' is used without a brief clarification of what 'lossless' precisely means in the context of inference (e.g., no accuracy degradation).
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address the two major comments point by point below. Where the comments identify opportunities for greater clarity or additional evidence, we agree to revise the manuscript accordingly.
read point-by-point responses
-
Referee: Abstract: The abstract states experimental speedups including the 4.66x figure but provides no details on experimental setup, baselines, variance, or whether the figure holds after controlling for implementation differences; only abstract-level claims are available so central performance numbers cannot be verified.
Authors: We agree that the abstract is currently high-level and would benefit from additional context to allow readers to better assess the performance claims. In the revised manuscript we will expand the abstract to briefly note the key experimental conditions (multi-node multi-GPU clusters, diverse MoE models, end-to-end latency measurement), the primary baselines (DeepSpeed-MoE and Megatron-LM variants), and that the reported speedups are averaged over multiple runs with standard deviation reported in the main experiments section. Detailed controls for implementation differences and variance are already presented in Section 5; the abstract revision will point readers to those results. revision: yes
-
Referee: Hierarchical sparse communication design (described in the multi-node setting): The premise that this design reduces cross-node traffic while implicitly aligning execution across nodes without introducing new synchronization bottlenecks or correctness issues is invoked to support the multi-node co-optimization and speedup claim, yet no explicit latency breakdowns, barrier timings, or before/after comparisons for synchronization costs are provided.
Authors: The hierarchical sparse communication design is intended to reduce cross-node traffic through locality-aware grouping and routing while achieving implicit alignment via the dynamic replication and routing decisions, as described in Section 4. We acknowledge that explicit breakdowns of synchronization overhead would strengthen the supporting evidence. In the revision we will add a new figure and accompanying text in Section 5.3 that provides (i) latency breakdowns separating communication, computation, and synchronization components across node counts, and (ii) before/after comparisons of barrier and all-reduce timings with and without the hierarchical design. These additions will directly address the concern about new bottlenecks. revision: yes
Circularity Check
No circularity: empirical systems design with experimental validation
full rationale
The paper presents GRACE-MoE as an engineering framework combining expert grouping, dynamic replication, locality-aware routing, and hierarchical sparse communication for distributed MoE inference. All performance claims (up to 4.66x speedup) are framed as outcomes of experiments on diverse models and multi-node multi-GPU setups rather than analytical derivations or first-principles predictions. No equations, fitted parameters, or self-referential theorems appear in the abstract or description; the design choices are justified by stated goals of reducing communication and correcting load skew, with results measured externally. This is a self-contained empirical systems contribution whose central claims rest on implementation and benchmarking, not on any reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
Hierarchical Mixture-of-Experts with Two-Stage Optimization
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...
-
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...
-
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
Reference graph
Works this paper leans on
-
[1]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[2]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing,
Seokjin Go and Divya Mahajan. Moetuner: Optimized mixture of expert serving with balanced expert placement and token routing.arXiv preprint arXiv:2502.06643,
-
[4]
Measuring mathematical problem solving with the math dataset.NeurIPS,
10 Under review as a conference paper at ICLR 2026 Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.NeurIPS,
work page 2026
-
[5]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[6]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[7]
Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed{MoE} training and inference with lina. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pp. 945–959, 2023a. Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints f...
-
[8]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture- of-experts language model.arXiv preprint arXiv:2405.04434,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Janus: A unified distributed training framework for sparse mixture-of-experts models
Juncai Liu, Jessie Hui Wang, and Yimin Jiang. Janus: A unified distributed training framework for sparse mixture-of-experts models. InProceedings of the ACM SIGCOMM 2023 Conference, pp. 486–498,
work page 2023
-
[10]
OLMoE: Open Mixture-of-Experts Language Models
Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Wei- jia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. Olmoe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
11 Under review as a conference paper at ICLR 2026 Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System opti- mizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 3505–3506,
work page 2026
-
[12]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Accelerating mixture-of-experts training with adaptive expert replication
Athinagoras Skiadopoulos, Mark Zhao, Swapnil Gandhi, Thomas Norrie, Shrijeet Mukherjee, and Christos Kozyrakis. Accelerating mixture-of-experts training with adaptive expert replication. arXiv preprint arXiv:2504.19925,
-
[14]
Lazarus: Resilient and elastic training of mixture-of-experts models with adaptive expert placement,
Yongji Wu, Wenjie Qu, Tianyang Tao, Zhuang Wang, Wei Bai, Zhuohao Li, Yuan Tian, Jiaheng Zhang, Matthew Lentz, and Danyang Zhuo. Lazarus: Resilient and elastic training of mixture-of- experts models with adaptive expert placement.arXiv preprint arXiv:2407.04656,
-
[15]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
12 Under review as a conference paper at ICLR 2026 Mohan Zhang, Pingzhi Li, Jie Peng, Mufan Qiu, and Tianlong Chen. Advancing MoE efficiency: A collaboration-constrained routing (C2R) strategy for better expert parallelism design. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.),Proceedings of the 2025 Conference of the Nations of the Americas Chapter of...
work page 2026
-
[17]
In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V
Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025. naacl-long.347. URLhttps://aclanthology.org/2025.naacl-long.347/. Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured la...
-
[18]
A APPENDIX A.1 LLM USAGESTATEMENT In preparing this paper, a large language model (LLM) was used solely for grammar checking and language polishing. The design and implementation of all algorithms and experiments, as well as the analysis of results and conclusions, were independently conducted and written by the authors. A.2 ALGORITHM FORCONTROLLEDNON-UNI...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.