Generalized on-policy distillation with reward scaling above one (ExOPD) lets student models surpass teacher performance when merging domain experts on math and code tasks.
Sequence-level knowledge distillation
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Self-distillation turns pretrained autoregressive LMs into multi-token predictors that decode over 3x faster with under 5% accuracy drop on GSM8K.
citing papers explorer
-
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Generalized on-policy distillation with reward scaling above one (ExOPD) lets student models surpass teacher performance when merging domain experts on math and code tasks.
-
Multi-Token Prediction via Self-Distillation
Self-distillation turns pretrained autoregressive LMs into multi-token predictors that decode over 3x faster with under 5% accuracy drop on GSM8K.