MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
Case-based or rule-based: How do transformers do the math? In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.