NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
hub
MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
dataset 1polarities
use dataset 1representative citing papers
DODOCO measurements show MoE routing imbalance is intrinsic to architecture and real text, not correctable by EP scaling or represented by mock tokens, forming two persistent Gini bands.
GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faster inference with comparable quality.
SpaceMoE partitions MoE layers across orbiting satellite subnets in a ring and optimizes expert placement by activation probability and path latency, yielding at least 3x lower inference latency in thousand-satellite simulations versus random baselines.
Profiling shows persistent expert load imbalance and domain-specific activation patterns in large MoE models; workload-aware grouping and placement reduce all-to-all communication volume by up to 20x.
Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.
GRACE-MoE integrates expert grouping, dynamic replication, and locality-aware routing with hierarchical sparse communication to reduce end-to-end latency in distributed SMoE inference.
Replication reduces costs by 17-65% on average in hypergraph partitioning and 11-23% in DAG scheduling, sometimes eliminating communication needs entirely.
Prism optimizes expert placement and uses runtime migration for distributed MoE inference on heterogeneous edge GPUs, achieving up to 30.6% lower latency than baselines.
citing papers explorer
-
NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
-
Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory
DODOCO measurements show MoE routing imbalance is intrinsic to architecture and real text, not correctable by EP scaling or represented by mock tokens, forming two persistent Gini bands.
-
GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems
GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.
-
Hierarchical Mixture-of-Experts with Two-Stage Optimization
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
-
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faster inference with comparable quality.
-
SpaceMoE: Realizing Distributed Mixture-of-Experts Inference over Space Networks
SpaceMoE partitions MoE layers across orbiting satellite subnets in a ring and optimizes expert placement by activation probability and path latency, yielding at least 3x lower inference latency in thousand-satellite simulations versus random baselines.
-
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
Profiling shows persistent expert load imbalance and domain-specific activation patterns in large MoE models; workload-aware grouping and placement reduce all-to-all communication volume by up to 20x.
-
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.
-
GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference
GRACE-MoE integrates expert grouping, dynamic replication, and locality-aware routing with hierarchical sparse communication to reduce end-to-end latency in distributed SMoE inference.
-
Replication in Graph Partitioning and Scheduling Problems
Replication reduces costs by 17-65% on average in hypergraph partitioning and 11-23% in DAG scheduling, sometimes eliminating communication needs entirely.
-
Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
Prism optimizes expert placement and uses runtime migration for distributed MoE inference on heterogeneous edge GPUs, achieving up to 30.6% lower latency than baselines.