Megablocks: Efficient sparse training with mixture-of-experts

Trevor Gale, Deepak Narayanan, Cliff Young, Matei Zaharia · 2022 · arXiv 2211.15841

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.

Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

Profiling shows persistent expert load imbalance and domain-specific activation patterns in large MoE models; workload-aware grouping and placement reduce all-to-all communication volume by up to 20x.

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

cs.CL · 2024-11-07 · conditional · novelty 6.0

MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

cs.DC · 2026-04-27 · unverdicted · novelty 5.0

TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.

Mixtral of Experts

cs.LG · 2024-01-08 · unverdicted · novelty 5.0

Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.

citing papers explorer

Showing 6 of 6 citing papers.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference cs.DC · 2026-05-11 · unverdicted · none · ref 7
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts cs.LG · 2026-04-28 · unverdicted · none · ref 10
RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns cs.LG · 2026-04-25 · unverdicted · none · ref 7
Profiling shows persistent expert load imbalance and domain-specific activation patterns in large MoE models; workload-aware grouping and placement reduce all-to-all communication volume by up to 20x.
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models cs.CL · 2024-11-07 · conditional · none · ref 12
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training cs.DC · 2026-04-27 · unverdicted · none · ref 19
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
Mixtral of Experts cs.LG · 2024-01-08 · unverdicted · none · ref 13
Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.

Megablocks: Efficient sparse training with mixture-of-experts

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer