Selective Sinkhorn Routing for Improved Sparse Mixture of Experts
read the original abstract
Sparse Mixture-of-Experts (SMoE) models are scalable and computationally efficient, enabling large increases in model capacity with limited inference overhead. Existing SMoE methods often depend on auxiliary objectives, such as load-balancing loss and z-loss, or additional trainable components such as noisy gating. While these techniques encourage expert diversity, they can introduce objective misalignment, increase model complexity, or incur substantial training overhead, especially in Sinkhorn-based routing methods. In this paper, we revisit the token-to-expert assignment as an optimal transport problem. We add constraints to ensure balanced expert utilization. We show that even minimal optimal transport-based routing improves SMoE performance without requiring auxiliary balancing losses. Unlike prior approaches, our method derives gating scores directly from the transport map, leading to more balanced and effective token-to-expert assignments. Building on this insight, we introduce Selective Sinkhorn Routing (SSR), a lightweight routing mechanism that replaces complex auxiliary losses with efficient Sinkhorn-based routing while preserving flexible expert selection. Experiments on language modeling and image classification show that SSR improves training efficiency, accuracy, and robustness to input corruption.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Visual Text Compression as Measure Transport
Framing visual text compression as measure transport decomposes encoding loss into precision and coverage costs, enabling a label-free routing rule that matches oracle performance on 17 of 24 NLP datasets while using ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.