Multi-head attention constructs a graph of dataset relations from the teacher embedding procedure and transfers it to the student via multi-task learning, yielding 7.05% higher CIFAR-100 accuracy than the student alone and 2.46% above prior SOTA.
Self-supervised knowledge distillation using singular value decomposition
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2019 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Graph-based Knowledge Distillation by Multi-head Attention Network
Multi-head attention constructs a graph of dataset relations from the teacher embedding procedure and transfers it to the student via multi-task learning, yielding 7.05% higher CIFAR-100 accuracy than the student alone and 2.46% above prior SOTA.