pith. sign in

arxiv: 2109.10465 · v1 · pith:NMB7QKC2new · submitted 2021-09-22 · 💻 cs.CL · cs.AI· cs.LG

Scalable and Efficient MoE Training for Multitask Multilingual Models

classification 💻 cs.CL cs.AIcs.LG
keywords modelssystemtrainingefficiencyefficientmodelmultilingualparameters
0
0 comments X
read the original abstract

The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EPnG: Adaptive Expert Prune-and-Grow for Parameter-Efficient MoE Fine-tuning

    cs.LG 2026-07 unverdicted novelty 6.0

    EPnG reallocates LoRA capacity in MoE models by pruning experts with low router gate probabilities and expanding high-importance ones via rank growth, outperforming standard LoRA and nearing full fine-tuning performan...

  2. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    cs.LG 2023-09 accept novelty 6.0

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  3. On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain

    cs.LG 2026-07 unverdicted novelty 4.0

    Moderate pruning of MoE models preserves in-domain biomedical utility and reliability but both degrade rapidly in cross-domain settings and at extreme pruning ratios.

  4. The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

    cs.LG 2026-06 unverdicted novelty 4.0

    A scaling law model derived from roofline analysis and a speedup-based efficiency factor predicts training energy for BERT models across GPU parallelism configurations.