The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , year =
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Extends uniform error bounds for Gaussian process vector-valued functions to linear models of co-regionalization and demonstrates performance gains via numerical comparison on a safe multi-task Bayesian optimization benchmark.
citing papers explorer
-
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
-
Safe Bayesian Optimization for Uncertain Correlation Matrices in Linear Models of Co-Regionalization
Extends uniform error bounds for Gaussian process vector-valued functions to linear models of co-regionalization and demonstrates performance gains via numerical comparison on a safe multi-task Bayesian optimization benchmark.