BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.
Maskmoe: Boosting token-level learning via routing mask in mixture-of-experts.arXiv preprint arXiv:2407.09816
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.
citing papers explorer
-
Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts
AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.