EKD trains lightweight NMT students progressively from a chain of teachers with rising capacity, achieving BLEU scores within 0.08 of the largest teacher on IWSLT-14.
Pre-training distillation for large language models: A design space exploration
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2roles
background 1polarities
background 1representative citing papers
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
citing papers explorer
-
Evolving Knowledge Distillation for Lightweight Neural Machine Translation
EKD trains lightweight NMT students progressively from a chain of teachers with rising capacity, achieving BLEU scores within 0.08 of the largest teacher on IWSLT-14.
-
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.