Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.
Condition on(X, D n)and use convexity of KL.:GivenXandD n, ˆY|(X, D n)is a mixture overW T |D n: ˆY|(X, D n)∼ Z qT (WT |D n)MN(W T X, I k, ν 2In)dW T
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.IT 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
On the Generalization of Knowledge Distillation: An Information-Theoretic View
Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.