PADD distills from dense teachers to MoE students via neuron clustering, expert warmup, online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing, yielding gains on math reasoning benchmarks.
arXiv preprint arXiv:2510.23027 , year=
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning
PADD distills from dense teachers to MoE students via neuron clustering, expert warmup, online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing, yielding gains on math reasoning benchmarks.