hub

MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing

· 2025 · arXiv 2502.06643

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding

cs.DC · 2026-05-20 · unverdicted · novelty 6.0

NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.

Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

cs.DC · 2026-05-20 · unverdicted · novelty 6.0

DODOCO measurements show MoE routing imbalance is intrinsic to architecture and real text, not correctable by EP scaling or represented by mock tokens, forming two persistent Gini bands.

GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

cs.DC · 2026-05-19 · unverdicted · novelty 6.0

GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.

Hierarchical Mixture-of-Experts with Two-Stage Optimization

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.

Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faster inference with comparable quality.

SpaceMoE: Realizing Distributed Mixture-of-Experts Inference over Space Networks

cs.DC · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

SpaceMoE partitions MoE layers across orbiting satellite subnets in a ring and optimizes expert placement by activation probability and path latency, yielding at least 3x lower inference latency in thousand-satellite simulations versus random baselines.

Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

Profiling shows persistent expert load imbalance and domain-specific activation patterns in large MoE models; workload-aware grouping and placement reduce all-to-all communication volume by up to 20x.

Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference

cs.DC · 2025-10-07 · conditional · novelty 6.0

Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.

GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference

cs.DC · 2025-09-29 · unverdicted · novelty 6.0

GRACE-MoE integrates expert grouping, dynamic replication, and locality-aware routing with hierarchical sparse communication to reduce end-to-end latency in distributed SMoE inference.

Replication in Graph Partitioning and Scheduling Problems

cs.DC · 2026-04-30 · unverdicted · novelty 5.0

Replication reduces costs by 17-65% on average in hypergraph partitioning and 11-23% in DAG scheduling, sometimes eliminating communication needs entirely.

Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement

cs.DC · 2025-08-18 · unverdicted · novelty 4.0

Prism optimizes expert placement and uses runtime migration for distributed MoE inference on heterogeneous edge GPUs, achieving up to 30.6% lower latency than baselines.

citing papers explorer

Showing 11 of 11 citing papers.

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding cs.DC · 2026-05-20 · unverdicted · none · ref 21
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory cs.DC · 2026-05-20 · unverdicted · none · ref 27
DODOCO measurements show MoE routing imbalance is intrinsic to architecture and real text, not correctable by EP scaling or represented by mock tokens, forming two persistent Gini bands.
GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems cs.DC · 2026-05-19 · unverdicted · none · ref 23
GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.
Hierarchical Mixture-of-Experts with Two-Stage Optimization cs.LG · 2026-05-08 · unverdicted · none · ref 11
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models cs.LG · 2026-05-07 · unverdicted · none · ref 8
FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faster inference with comparable quality.
SpaceMoE: Realizing Distributed Mixture-of-Experts Inference over Space Networks cs.DC · 2026-05-01 · unverdicted · none · ref 14 · 2 links
SpaceMoE partitions MoE layers across orbiting satellite subnets in a ring and optimizes expert placement by activation probability and path latency, yielding at least 3x lower inference latency in thousand-satellite simulations versus random baselines.
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns cs.LG · 2026-04-25 · unverdicted · none · ref 8
Profiling shows persistent expert load imbalance and domain-specific activation patterns in large MoE models; workload-aware grouping and placement reduce all-to-all communication volume by up to 20x.
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference cs.DC · 2025-10-07 · conditional · none · ref 18
Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.
GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference cs.DC · 2025-09-29 · unverdicted · none · ref 3
GRACE-MoE integrates expert grouping, dynamic replication, and locality-aware routing with hierarchical sparse communication to reduce end-to-end latency in distributed SMoE inference.
Replication in Graph Partitioning and Scheduling Problems cs.DC · 2026-04-30 · unverdicted · none · ref 20
Replication reduces costs by 17-65% on average in hypergraph partitioning and 11-23% in DAG scheduling, sometimes eliminating communication needs entirely.
Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement cs.DC · 2025-08-18 · unverdicted · none · ref 8
Prism optimizes expert placement and uses runtime migration for distributed MoE inference on heterogeneous edge GPUs, achieving up to 30.6% lower latency than baselines.

MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer