The paper provides novel lower bounds connecting L1 distances of mixture densities to discrepancies in mixing measures, leading to first contraction rates for Dirichlet process mixtures with unknown scale.
Competesmoe--effective training of sparse mixture of experts via competition
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.
citing papers explorer
-
Convergence Rates for Latent Mixing Measures in Infinite Homoscedastic Location-Scale Mixture Models
The paper provides novel lower bounds connecting L1 distances of mixture densities to discrepancies in mixing measures, leading to first contraction rates for Dirichlet process mixtures with unknown scale.
-
Tight Clusters Make Specialized Experts
Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.