MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
Our conclusion regarding a consistent optimal activation rate contradicts the findings of Abnar et al
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.