The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
Kernel interpolation generalizes poorly
2 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 2representative citing papers
A random-projection differentially private kernel ERM method attains minimax-optimal excess risk bounds for squared and Lipschitz-smooth convex losses under local strong convexity, plus the first dimension-free bounds for objective-perturbation private linear ERM.
citing papers explorer
-
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
-
Optimal differentially private kernel learning with random projection
A random-projection differentially private kernel ERM method attains minimax-optimal excess risk bounds for squared and Lipschitz-smooth convex losses under local strong convexity, plus the first dimension-free bounds for objective-perturbation private linear ERM.