hub

MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing

Seokjin Go, Divya Mahajan · 2025 · arXiv 2502.06643

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving

cs.DC · 2026-05-30 · unverdicted · novelty 7.0

ViBE co-optimizes expert placement with measured GPU performance variability in MoE inference to cut execution-time imbalance, delivering 14% better SLO attainment and up to 45% lower P90 TTFT.

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.

Birkhoff Decompositions and Photonic Interconnects Wait! Don't Forget the Compute!

cs.NI · 2026-05-26 · unverdicted · novelty 6.0

A greedy max-weight decomposition strategy for MoE all-to-all communication on photonic fabrics improves overlap efficiency and reduces compute overheads compared to BvN by bounding the number of matchings.

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding

cs.DC · 2026-05-20 · unverdicted · novelty 6.0

NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.

Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

cs.DC · 2026-05-20 · unverdicted · novelty 6.0

DODOCO measurements show MoE routing imbalance is intrinsic to architecture and real text, not correctable by EP scaling or represented by mock tokens, forming two persistent Gini bands.

GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

cs.DC · 2026-05-19 · unverdicted · novelty 6.0

GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.

Hierarchical Mixture-of-Experts with Two-Stage Optimization

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.

Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faster inference with comparable quality.

SpaceMoE: Realizing Distributed Mixture-of-Experts Inference over Space Networks

cs.DC · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

SpaceMoE partitions MoE layers across orbiting satellite subnets in a ring and optimizes expert placement by activation probability and path latency, yielding at least 3x lower inference latency in thousand-satellite simulations versus random baselines.

Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

Profiling shows persistent expert load imbalance and domain-specific activation patterns in large MoE models; workload-aware grouping and placement reduce all-to-all communication volume by up to 20x.

Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference

cs.DC · 2025-10-07 · conditional · novelty 6.0

Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.

GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference

cs.DC · 2025-09-29 · unverdicted · novelty 6.0

GRACE-MoE integrates expert grouping, dynamic replication, and locality-aware routing with hierarchical sparse communication to reduce end-to-end latency in distributed SMoE inference.

Replication in Graph Partitioning and Scheduling Problems

cs.DC · 2026-04-30 · unverdicted · novelty 5.0

Replication reduces costs by 17-65% on average in hypergraph partitioning and 11-23% in DAG scheduling, sometimes eliminating communication needs entirely.

Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement

cs.DC · 2025-08-18 · unverdicted · novelty 4.0

Prism optimizes expert placement and uses runtime migration for distributed MoE inference on heterogeneous edge GPUs, achieving up to 30.6% lower latency than baselines.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Birkhoff Decompositions and Photonic Interconnects Wait! Don't Forget the Compute! cs.NI · 2026-05-26 · unverdicted · none · ref 13
A greedy max-weight decomposition strategy for MoE all-to-all communication on photonic fabrics improves overlap efficiency and reduces compute overheads compared to BvN by bounding the number of matchings.

MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer