Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.
Efficient knowledge distillation from model checkpoints,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.IT 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
On the Generalization of Knowledge Distillation: An Information-Theoretic View
Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.