Variational Information Distillation for Knowledge Transfer

Andreas Damianou; Neil D. Lawrence; Shell Xu Hu; Sungsoo Ahn; Zhenwen Dai

arxiv: 1904.05835 · v1 · pith:2SW3DGOTnew · submitted 2019-04-11 · 💻 cs.CV · cs.AI· cs.LG

Variational Information Distillation for Knowledge Transfer

Sungsoo Ahn , Shell Xu Hu , Andreas Damianou , Neil D. Lawrence , Zhenwen Dai This is my paper

classification 💻 cs.CV cs.AIcs.LG

keywords knowledgetransfernetworkneuralstudentexistingmethodmethods

0 comments

read the original abstract

Transferring knowledge from a teacher neural network pretrained on the same or a similar task to a student neural network can significantly improve the performance of the student neural network. Existing knowledge transfer approaches match the activations or the corresponding hand-crafted features of the teacher and the student networks. We propose an information-theoretic framework for knowledge transfer which formulates knowledge transfer as maximizing the mutual information between the teacher and the student networks. We compare our method with existing knowledge transfer methods on both knowledge distillation and transfer learning tasks and show that our method consistently outperforms existing methods. We further demonstrate the strength of our method on knowledge transfer across heterogeneous network architectures by transferring knowledge from a convolutional neural network (CNN) to a multi-layer perceptron (MLP) on CIFAR-10. The resulting MLP significantly outperforms the-state-of-the-art methods and it achieves similar performance to the CNN with a single convolutional layer.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Generalization of Knowledge Distillation: An Information-Theoretic View
cs.IT 2026-05 unverdicted novelty 7.0

Knowledge distillation generalization bounds are derived via a new distillation divergence measuring teacher-student kernel difference, with tighter bounds from teacher loss flatness.
On the Generalization of Knowledge Distillation: An Information-Theoretic View
cs.IT 2026-05 unverdicted novelty 7.0

Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.