DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training

Amit Hasan; Can Jin; Dimitris N. Metaxas; Hongwu Peng; Mingcan Xiang; Ohi Dibua; Qixin Zhang; Xiangchi Yuan; Yan Kang; Yifan Gong

arxiv: 2512.13996 · v3 · pith:MK3FHH6Bnew · submitted 2025-12-16 · 💻 cs.AI

DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training

Can Jin , Hongwu Peng , Mingcan Xiang , Qixin Zhang , Xiangchi Yuan , Amit Hasan , Ohi Dibua , Yifan Gong

show 2 more authors

Yan Kang Dimitris N. Metaxas

This is my paper

classification 💻 cs.AI

keywords top-routingmodeldtop-dynamicexpertprobabilitycapacity

0 comments

read the original abstract

Sparse Mixture-of-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-$k$ routing imposes a rigid sparsity pattern that ignores the intrinsic variance in token difficulty and layer-specific computational needs. Top-$p$ routing is more adaptive because it selects experts until their cumulative routing probability reaches a threshold, allowing confident tokens to use fewer experts and ambiguous tokens to recruit more. However, we demonstrate that existing naive Top-$p$ implementations with fixed global probability thresholds provide only marginal gains over Top-$k$, suffer from hyperparameter sensitivity, and result in uncontrolled computational costs. In this paper, we propose **DTop-$p$**, a sparsity-controllable dynamic routing mechanism that learns the Top-$p$ probability threshold with a Proportional-Integral controller and uses dynamic routing normalization to support layer-wise expert selection under a global sparsity constraint. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that **DTop-$p$** consistently outperforms both Top-$k$ and fixed Top-$p$ baselines while matching the average FLOPs of Top-$k$ MoE. Our analysis confirms that **DTop-$p$** exhibits strong scaling properties across expert granularity, total expert capacity, model size, and dataset size, offering a robust and efficient MoE framework for foundation model pre-training.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE
cs.AI 2026-05 conditional novelty 6.0

BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.