pith. sign in

arxiv: 2509.25041 · v4 · submitted 2025-09-29 · 💻 cs.DC

GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference

Pith reviewed 2026-05-18 11:58 UTC · model grok-4.3

classification 💻 cs.DC
keywords Sparse Mixture of ExpertsDistributed InferenceExpert GroupingDynamic ReplicationLocality-Aware RoutingCommunication OptimizationLoad BalancingLLM Serving
0
0 comments X

The pith

GRACE-MoE combines expert grouping, dynamic replication, and locality-aware routing to cut distributed SMoE inference latency up to 4.66x without accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GRACE-MoE as a lossless framework for distributed inference of sparse mixture-of-experts models in large language models. It resolves the core tension that cutting communication often worsens load imbalance by grouping experts to shrink traffic, replicating them on the fly to balance compute, and routing queries to nearby replicas. A hierarchical sparse communication layer further lowers cross-node exchanges while keeping nodes loosely synchronized. Tests across models and multi-node GPU clusters show the combined changes deliver large end-to-end speedups.

Core claim

GRACE-MoE is a lossless co-optimization framework that integrates expert grouping to reduce communication and dynamic replication to correct load skew, together with locality-aware routing to resolve replica selection. To underpin this coordinated optimization in multi-node settings, GRACE-MoE adopts a hierarchical sparse communication design that reduces cross-node traffic while implicitly aligning execution across nodes, thereby mitigating synchronization overhead.

What carries the argument

The GRACE-MoE coordination of expert grouping, dynamic replication, and locality-aware routing, backed by hierarchical sparse communication that cuts cross-node traffic and aligns node execution.

If this is right

  • End-to-end inference latency drops in multi-node multi-GPU clusters.
  • Speedups reach 4.66 times versus prior distributed systems on varied models.
  • The optimizations preserve model accuracy while improving hardware efficiency.
  • The approach scales to different SMoE architectures without redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar grouping and replication tactics could transfer to other communication-bound distributed training or serving workloads.
  • The implicit alignment property may reduce the need for explicit barriers in future large-scale expert systems.
  • Public release of the code would allow direct measurement of gains on new hardware generations.

Load-bearing premise

The hierarchical sparse communication design reduces cross-node traffic while implicitly aligning execution across nodes without introducing new synchronization bottlenecks or correctness issues.

What would settle it

A multi-node run that shows either higher synchronization overhead than single-node baselines or incorrect model outputs would falsify the claim that the hierarchical design works without side effects.

Figures

Figures reproduced from arXiv: 2509.25041 by Hanqi Zhu, Jie Peng, Lehan Pan, Wuyang Zhang, Yanyong Zhang, Yu Han, Ziyang Tao.

Figure 1
Figure 1. Figure 1: Grouping strictness and replication strategies. Experiments on OLMoE with 2 nodes × 2 GPUs/node, metrics reported in tokens. (a) Relaxing grouping strictness reduces communica￾tion compared to Vanilla. (b) Replicating highly activated experts alleviates load imbalance more effectively than replicating widely collaborative experts, relative to Hierarchical Grouping (HG). CoServe (Suo et al., 2025) and many … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of GRACE-MoE. (a) Profiling expert selections to build affinity matrices. (b) Grouping high-affinity experts on the same device and dynamically replicating hot experts to balance computational load. (c) Adaptive routing reduces communication by prioritizing local replicas and balances requests via weighted round-robin with load prediction across remote replicas. 4 METHOD We propose GRACE-MoE, a hy… view at source ↗
Figure 3
Figure 3. Figure 3: Computational load distribution after hierarchical grouping. (a) Sampled layers show that affinity clustering concentrates load only on a few groups. (b) In Layer 5, per-expert loads in the heaviest group reveal that overload comes from a few frequently activated experts, not all. 4.3 ROUTING POLICY: CO-OPTIMIZES COMMUNICATION AND COMPUTATIONAL LOAD After replication, multiple expert instances exist, and t… view at source ↗
Figure 4
Figure 4. Figure 4: End-to-end inference latency and MoE layer time. Evaluation of GRACE-MoE and all baselines across three models with batch size = 128, prefill length = 64, and decode length = 16. Baselines and Evaluation Metrics. Baselines: We compare against widely used SMoE inference systems, including DeepSpeed (Rasley et al., 2020), Tutel (Hwang et al., 2023), Megablocks (Gale et al., 2023), vLLM (Kwon et al., 2023), a… view at source ↗
Figure 5
Figure 5. Figure 5: Component analysis. Grouping, replication and routing schemes are compared across three models under a 2 node × 2 GPUs/node setup on the WikiText dataset. Abbreviations: Vanilla (Average Grouping), HG (Hierarchical Grouping), FR (Fixed-Count Replication), DR (Dynamic￾Count Replication), WRR (Weighted Round-Robin with Load Prediction), TAR (Topology-Aware Routing with Locality Preference). Research Question… view at source ↗
Figure 6
Figure 6. Figure 6: Generalization across datasets. Expert affinity matrices from WikiText-2-v1, MATH, GitHub, and their mixture are cross-validated across models, communication overhead measured in tokens shows that GRACE-MoE sustains strong performance even under distribution shifts. aware routing with locality preference. Compared with pure weighted round-robin, this strategy reduces All-to-All time by 2.8%, 8.3%, and 10.0… view at source ↗
read the original abstract

Sparse Mixture of Experts (SMoE) enables scalable parameter growth in large language models (LLMs) by selectively activating a subset of experts, and its large parameter count necessitates distributed deployment for inference. However, distributed inference faces a critical dilemma: although communication overhead constitutes the primary bottleneck, reducing it often exacerbates computational load imbalance, leading to resource waste. In this paper, we present GRACE-MoE, which stands for Grouping and Replication with Locality-Aware Routing for SMoE inference. GRACE-MoE is a lossless co-optimization framework that integrates expert grouping to reduce communication and dynamic replication to correct load skew, together with locality-aware routing to resolve replica selection. To underpin this coordinated optimization in multi-node settings, GRACE-MoE adopts a hierarchical sparse communication design that reduces cross-node traffic while implicitly aligning execution across nodes, thereby mitigating synchronization overhead. Experiments on diverse models and multi-node, multi-GPU environments demonstrate that GRACE-MoE efficiently reduces end-to-end inference latency, achieving up to 4.66x speedup over existing systems, and the code will be released upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents GRACE-MoE, a lossless co-optimization framework for distributed Sparse Mixture of Experts (SMoE) inference. It integrates expert grouping to reduce communication overhead, dynamic replication to correct load skew, locality-aware routing to resolve replica selection, and a hierarchical sparse communication design that reduces cross-node traffic while implicitly aligning execution across nodes to mitigate synchronization overhead. Experiments on diverse models and multi-node multi-GPU environments are reported to achieve up to 4.66x speedup over existing systems, with code to be released upon acceptance.

Significance. If the performance claims hold with detailed, reproducible validation, the work would be significant for practical distributed inference of large MoE models. It directly targets the communication-versus-load-imbalance dilemma through coordinated grouping, replication, and routing, potentially improving end-to-end latency in multi-node GPU clusters while preserving model correctness. The hierarchical sparse communication approach, if shown to avoid new bottlenecks, could inform future systems designs for scalable LLM serving.

major comments (2)
  1. Abstract: The abstract states experimental speedups including the 4.66x figure but provides no details on experimental setup, baselines, variance, or whether the figure holds after controlling for implementation differences; only abstract-level claims are available so central performance numbers cannot be verified.
  2. Hierarchical sparse communication design (described in the multi-node setting): The premise that this design reduces cross-node traffic while implicitly aligning execution across nodes without introducing new synchronization bottlenecks or correctness issues is invoked to support the multi-node co-optimization and speedup claim, yet no explicit latency breakdowns, barrier timings, or before/after comparisons for synchronization costs are provided.
minor comments (1)
  1. Abstract: The phrase 'lossless co-optimization' is used without a brief clarification of what 'lossless' precisely means in the context of inference (e.g., no accuracy degradation).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address the two major comments point by point below. Where the comments identify opportunities for greater clarity or additional evidence, we agree to revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: The abstract states experimental speedups including the 4.66x figure but provides no details on experimental setup, baselines, variance, or whether the figure holds after controlling for implementation differences; only abstract-level claims are available so central performance numbers cannot be verified.

    Authors: We agree that the abstract is currently high-level and would benefit from additional context to allow readers to better assess the performance claims. In the revised manuscript we will expand the abstract to briefly note the key experimental conditions (multi-node multi-GPU clusters, diverse MoE models, end-to-end latency measurement), the primary baselines (DeepSpeed-MoE and Megatron-LM variants), and that the reported speedups are averaged over multiple runs with standard deviation reported in the main experiments section. Detailed controls for implementation differences and variance are already presented in Section 5; the abstract revision will point readers to those results. revision: yes

  2. Referee: Hierarchical sparse communication design (described in the multi-node setting): The premise that this design reduces cross-node traffic while implicitly aligning execution across nodes without introducing new synchronization bottlenecks or correctness issues is invoked to support the multi-node co-optimization and speedup claim, yet no explicit latency breakdowns, barrier timings, or before/after comparisons for synchronization costs are provided.

    Authors: The hierarchical sparse communication design is intended to reduce cross-node traffic through locality-aware grouping and routing while achieving implicit alignment via the dynamic replication and routing decisions, as described in Section 4. We acknowledge that explicit breakdowns of synchronization overhead would strengthen the supporting evidence. In the revision we will add a new figure and accompanying text in Section 5.3 that provides (i) latency breakdowns separating communication, computation, and synchronization components across node counts, and (ii) before/after comparisons of barrier and all-reduce timings with and without the hierarchical design. These additions will directly address the concern about new bottlenecks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems design with experimental validation

full rationale

The paper presents GRACE-MoE as an engineering framework combining expert grouping, dynamic replication, locality-aware routing, and hierarchical sparse communication for distributed MoE inference. All performance claims (up to 4.66x speedup) are framed as outcomes of experiments on diverse models and multi-node multi-GPU setups rather than analytical derivations or first-principles predictions. No equations, fitted parameters, or self-referential theorems appear in the abstract or description; the design choices are justified by stated goals of reducing communication and correcting load skew, with results measured externally. This is a self-contained empirical systems contribution whose central claims rest on implementation and benchmarking, not on any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes that expert activation patterns are sufficiently stable to benefit from static grouping and that replication decisions can be made without prohibitive overhead.

pith-pipeline@v0.9.0 · 5745 in / 1234 out tokens · 24588 ms · 2026-05-18T11:58:20.405723+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Hierarchical Mixture-of-Experts with Two-Stage Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...

  2. Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...

  3. Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 6.0

    EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 3 Pith papers · 7 internal anchors

  1. [1]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  2. [2]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

  3. [3]

    MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing,

    Seokjin Go and Divya Mahajan. Moetuner: Optimized mixture of expert serving with balanced expert placement and token routing.arXiv preprint arXiv:2502.06643,

  4. [4]

    Measuring mathematical problem solving with the math dataset.NeurIPS,

    10 Under review as a conference paper at ICLR 2026 Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.NeurIPS,

  5. [5]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  6. [6]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,

  7. [7]

    org/CorpusID:220265858

    Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed{MoE} training and inference with lina. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pp. 945–959, 2023a. Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints f...

  8. [8]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture- of-experts language model.arXiv preprint arXiv:2405.04434,

  9. [9]

    Janus: A unified distributed training framework for sparse mixture-of-experts models

    Juncai Liu, Jessie Hui Wang, and Yimin Jiang. Janus: A unified distributed training framework for sparse mixture-of-experts models. InProceedings of the ACM SIGCOMM 2023 Conference, pp. 486–498,

  10. [10]

    OLMoE: Open Mixture-of-Experts Language Models

    Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Wei- jia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. Olmoe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060,

  11. [11]

    Deepspeed: System opti- mizations enable training deep learning models with over 100 billion parameters

    11 Under review as a conference paper at ICLR 2026 Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System opti- mizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 3505–3506,

  12. [12]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

  13. [13]

    Accelerating mixture-of-experts training with adaptive expert replication

    Athinagoras Skiadopoulos, Mark Zhao, Swapnil Gandhi, Thomas Norrie, Shrijeet Mukherjee, and Christos Kozyrakis. Accelerating mixture-of-experts training with adaptive expert replication. arXiv preprint arXiv:2504.19925,

  14. [14]

    Lazarus: Resilient and elastic training of mixture-of-experts models with adaptive expert placement,

    Yongji Wu, Wenjie Qu, Tianyang Tao, Zhuang Wang, Wei Bai, Zhuohao Li, Yuan Tian, Jiaheng Zhang, Matthew Lentz, and Danyang Zhuo. Lazarus: Resilient and elastic training of mixture-of- experts models with adaptive expert placement.arXiv preprint arXiv:2407.04656,

  15. [15]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  16. [16]

    Advancing MoE efficiency: A collaboration-constrained routing (C2R) strategy for better expert parallelism design

    12 Under review as a conference paper at ICLR 2026 Mohan Zhang, Pingzhi Li, Jie Peng, Mufan Qiu, and Tianlong Chen. Advancing MoE efficiency: A collaboration-constrained routing (C2R) strategy for better expert parallelism design. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.),Proceedings of the 2025 Conference of the Nations of the Americas Chapter of...

  17. [17]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025. naacl-long.347. URLhttps://aclanthology.org/2025.naacl-long.347/. Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured la...

  18. [18]

    The design and implementation of all algorithms and experiments, as well as the analysis of results and conclusions, were independently conducted and written by the authors

    A APPENDIX A.1 LLM USAGESTATEMENT In preparing this paper, a large language model (LLM) was used solely for grammar checking and language polishing. The design and implementation of all algorithms and experiments, as well as the analysis of results and conclusions, were independently conducted and written by the authors. A.2 ALGORITHM FORCONTROLLEDNON-UNI...