Yuan 2.0-m32: Mixture of experts with attention router

Shaohua Wu, Jiangang Luo, Xi Chen, Lingjun Li, Xudong Zhao, Tong Yu, Chao Wang, Yue Wang, Fei Wang, Weixu Qiao, et al · arXiv 2405.17976

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

cs.LG · 2026-01-29 · unverdicted · novelty 6.0

L2R improves MoE performance by routing in a low-rank space with Lipschitz-controlled saturated inner-product scoring and multi-anchor mechanisms.

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

cs.CL · 2025-06-13 · conditional · novelty 6.0

MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.

citing papers explorer

Showing 2 of 2 citing papers.

L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts cs.LG · 2026-01-29 · unverdicted · none · ref 10
L2R improves MoE performance by routing in a low-rank space with Lipschitz-controlled saturated inner-product scoring and multi-anchor mechanisms.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 41
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.

Yuan 2.0-m32: Mixture of experts with attention router

fields

years

verdicts

representative citing papers

citing papers explorer