FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.DC 2verdicts
UNVERDICTED 2representative citing papers
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.
citing papers explorer
-
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
-
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.