Scalable and Efficient MoE Training for Multitask Multilingual Models

Scalable, efficient MoE training for multitask multilingual models · 2021 · arXiv 2109.10465

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

EPnG: Adaptive Expert Prune-and-Grow for Parameter-Efficient MoE Fine-tuning

cs.LG · 2026-07-02 · unverdicted · novelty 6.0

EPnG reallocates LoRA capacity in MoE models by pruning experts with low router gate probabilities and expanding high-importance ones via rank growth, outperforming standard LoRA and nearing full fine-tuning performance with 0.55-0.72% parameters updated.

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

cs.LG · 2023-09-25 · accept · novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain

cs.LG · 2026-07-01 · unverdicted · novelty 4.0

Moderate pruning of MoE models preserves in-domain biomedical utility and reliability but both degrade rapidly in cross-domain settings and at extreme pruning ratios.

citing papers explorer

Showing 3 of 3 citing papers.

EPnG: Adaptive Expert Prune-and-Grow for Parameter-Efficient MoE Fine-tuning cs.LG · 2026-07-02 · unverdicted · none · ref 13
EPnG reallocates LoRA capacity in MoE models by pruning experts with low router gate probabilities and expanding high-importance ones via rank growth, outperforming standard LoRA and nearing full fine-tuning performance with 0.55-0.72% parameters updated.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models cs.LG · 2023-09-25 · accept · none · ref 171
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain cs.LG · 2026-07-01 · unverdicted · none · ref 10
Moderate pruning of MoE models preserves in-domain biomedical utility and reliability but both degrade rapidly in cross-domain settings and at extreme pruning ratios.

Scalable and Efficient MoE Training for Multitask Multilingual Models

fields

years

verdicts

representative citing papers

citing papers explorer