L2R improves MoE performance by routing in a low-rank space with Lipschitz-controlled saturated inner-product scoring and multi-anchor mechanisms.
Yuan 2.0-m32: Mixture of experts with attention router
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
representative citing papers
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
citing papers explorer
-
L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts
L2R improves MoE performance by routing in a low-rank space with Lipschitz-controlled saturated inner-product scoring and multi-anchor mechanisms.
-
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.