Switchless topologies such as 3D full-mesh are 20.6-56.2% more cost-effective than scale-up networks for MoE LLM serving, with current link bandwidths over-provisioned by up to 27%.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
JANUS disaggregates attention and MoE layers onto separate GPU pools with an expert-balancing scheduler and SLO-aware scaling, delivering up to 4.7x higher per-GPU throughput than prior MoE systems under token-level latency constraints.
Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.
citing papers explorer
-
Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving
Switchless topologies such as 3D full-mesh are 20.6-56.2% more cost-effective than scale-up networks for MoE LLM serving, with current link bandwidths over-provisioned by up to 27%.
-
Janus: Disaggregating Attention and Experts for Scalable MoE Inference
JANUS disaggregates attention and MoE layers onto separate GPU pools with an expert-balancing scheduler and SLO-aware scaling, delivering up to 4.7x higher per-GPU throughput than prior MoE systems under token-level latency constraints.
-
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill
Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.
-
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.