Graph-based Knowledge Distillation by Multi-head Attention Network

Byung Cheol Song; Seunghyun Lee

arxiv: 1907.02226 · v2 · pith:M3LJ3JVQnew · submitted 2019-07-04 · 💻 cs.LG · stat.ML

Graph-based Knowledge Distillation by Multi-head Attention Network

Seunghyun Lee , Byung Cheol Song This is my paper

Pith reviewed 2026-05-25 09:31 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords knowledge distillationmulti-head attentiongraph-based methodstudent networkteacher networkrelational inductive biasCIFAR100convolutional neural networks

0 comments

The pith

Multi-head attention builds a graph of dataset relations to distill knowledge from teacher to student networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a way to transfer not only individual data points but the relational structure across an entire dataset from a large teacher network to a small student network. It does this by using multi-head attention to create a graph that captures how the teacher relates different inputs, then training the student with this graph information alongside its usual task. This addresses the limitation of conventional knowledge distillation, which focuses only on point-wise knowledge and misses dataset-level patterns. A sympathetic reader would care because it could let compact models achieve better results on classification tasks like those in CIFAR-100 by inheriting the teacher's understanding of data relationships.

Core claim

The proposed method distills dataset-based knowledge from the teacher network to a graph using multi-head attention, enabling multi-task learning that provides relational inductive bias to the student network and improves its performance.

What carries the argument

Multi-head attention network that distills the embedding procedure of the teacher into a graph representing intra-data relations.

If this is right

Student network performance increases by 7.05% compared to training alone on CIFAR100.
The method outperforms the state-of-the-art by 2.46% on the same dataset.
The attention-based graph supplies clear information about the source dataset to the student.
Multi-task learning with the distilled graph imparts useful inductive bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This graph approach might apply to other modalities where capturing relations between samples improves transfer.
It raises the question of whether similar attention mechanisms could distill knowledge in unsupervised settings.
Extensions could test if the graph structure generalizes across different teacher architectures.

Load-bearing premise

The multi-head attention applied to the teacher's embeddings extracts relational information that acts as transferable inductive bias for the student network beyond standard distillation techniques.

What would settle it

If experiments on CIFAR100 or similar datasets show that adding the multi-head attention graph does not increase student accuracy compared to conventional knowledge distillation methods using the same teacher, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 1907.02226 by Byung Cheol Song, Seunghyun Lee.

**Figure 2.** Figure 2: Attention heads and an estimator for learning MHAN. Here [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The training curves corresponding to Tables 1 and 2. (a) VGG-CIFAR100 (b) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The block diagram for network architectures used in the proposed scheme. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Knowledge distillation (KD) is a technique to derive optimal performance from a small student network (SN) by distilling knowledge of a large teacher network (TN) and transferring the distilled knowledge to the small SN. Since a role of convolutional neural network (CNN) in KD is to embed a dataset so as to perform a given task well, it is very important to acquire knowledge that considers intra-data relations. Conventional KD methods have concentrated on distilling knowledge in data units. To our knowledge, any KD methods for distilling information in dataset units have not yet been proposed. Therefore, this paper proposes a novel method that enables distillation of dataset-based knowledge from the TN using an attention network. The knowledge of the embedding procedure of the TN is distilled to graph by multi-head attention (MHA), and multi-task learning is performed to give relational inductive bias to the SN. The MHA can provide clear information about the source dataset, which can greatly improves the performance of the SN. Experimental results show that the proposed method is 7.05% higher than the SN alone for CIFAR100, which is 2.46% higher than the state-of-the-art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new angle is distilling dataset-level relations via an MHA-built graph fed into multi-task learning on the student, but the reported gains lack the ablation needed to tie them to that graph rather than the extra task.

read the letter

The central move here is to treat the teacher's embedding process as something that produces relational structure across the whole dataset, then use multi-head attention to turn that into a graph which gets passed to the student as an auxiliary task. That dataset-unit framing is presented as new, and it directly addresses the point that standard KD only transfers per-sample signals while ignoring how the data points relate to each other inside the teacher's representation space. The reported lift on CIFAR100 (7.05 % over the plain student, 2.46 % over prior SOTA) is the concrete evidence they offer for the idea working in practice. That is the part worth noting if you work in this corner of model compression. The construction itself is straightforward: MHA on teacher features yields the graph, then the student is trained with both its usual loss and a term that tries to match the graph structure. Nothing in the abstract suggests the math is especially heavy or that they needed new theory to make it run. The soft spot is exactly the one the stress-test flags. There is no described control that keeps the multi-task wrapper fixed while replacing the MHA graph with a random or constant structure. Without that, the numerical improvement could be coming from the simple fact of adding a second head rather than from any transferable inductive bias encoded in the graph. The abstract also gives no protocol details, no variance numbers, and no list of the baselines that were re-run, so the 2.46 % edge is hard to evaluate on its own. This is a niche paper aimed at people already tuning KD pipelines for deployment. A reader who wants another graph-based trick to try on CIFAR-scale problems could extract the MHA-to-graph step and test it themselves, but the current write-up does not supply enough controls to treat the claimed mechanism as demonstrated. I would not send it to review in its present form; the missing ablation is load-bearing for the main claim.

Referee Report

2 major / 0 minor

Summary. The paper proposes a graph-based knowledge distillation method in which multi-head attention (MHA) is applied to a teacher network's embeddings to construct a relational graph that encodes dataset-level structure; this graph is then used within a multi-task learning framework to transfer relational inductive bias to a smaller student network. The abstract reports that the approach yields a 7.05% accuracy improvement over the student alone and a 2.46% improvement over prior state-of-the-art on CIFAR-100.

Significance. If the performance gains can be shown to arise specifically from the MHA-derived relational graph rather than from the multi-task learning setup itself, the work would introduce a new mechanism for transferring dataset-level inductive biases in knowledge distillation, addressing a limitation of conventional point-wise KD methods. The absence of experimental protocol, baseline details, and isolating ablations in the provided text prevents assessment of whether this contribution is realized.

major comments (2)

[Abstract] Abstract: the central claim that the MHA-constructed graph supplies transferable relational inductive bias beyond point-wise KD is not isolated from the multi-task learning procedure; no ablation is described that holds the multi-task structure fixed while replacing the MHA graph with a non-informative (random or constant) structure, so the reported 7.05% and 2.46% gains cannot be attributed to the claimed mechanism.
[Abstract] Abstract: the numerical results are presented without any description of the experimental protocol, choice of teacher/student architectures, training hyperparameters, number of runs, or statistical tests, rendering the performance claims unverifiable and preventing evaluation of whether the method outperforms conventional KD baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments, which identify key areas where the manuscript can be strengthened for clarity and rigor. We address each major comment below and will revise the paper to incorporate the necessary changes.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the MHA-constructed graph supplies transferable relational inductive bias beyond point-wise KD is not isolated from the multi-task learning procedure; no ablation is described that holds the multi-task structure fixed while replacing the MHA graph with a non-informative (random or constant) structure, so the reported 7.05% and 2.46% gains cannot be attributed to the claimed mechanism.

Authors: We agree that the manuscript does not describe an ablation that holds the multi-task learning framework fixed while substituting a non-informative (random or constant) graph for the MHA-derived structure. Such an experiment is required to isolate whether the reported gains derive specifically from the relational inductive bias encoded by the MHA graph. We will add this control experiment to the revised manuscript. revision: yes
Referee: [Abstract] Abstract: the numerical results are presented without any description of the experimental protocol, choice of teacher/student architectures, training hyperparameters, number of runs, or statistical tests, rendering the performance claims unverifiable and preventing evaluation of whether the method outperforms conventional KD baselines.

Authors: We acknowledge that the abstract (and the provided text) contains no description of the experimental protocol, architectures, hyperparameters, number of runs, or statistical tests. We will revise the manuscript to include a clear experimental protocol section detailing the teacher and student architectures, training hyperparameters, evaluation on CIFAR-100, number of runs, and any statistical measures, along with explicit comparisons to conventional point-wise KD baselines. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical method only

full rationale

The paper describes an empirical KD technique that constructs a graph via multi-head attention on teacher embeddings and injects it via multi-task learning on the student. No equations, first-principles derivations, or parameter-fitting steps are shown that could reduce to their own inputs by construction. Performance numbers (e.g., +7.05 % on CIFAR-100) are reported experimental outcomes, not predictions obtained by fitting to the same data. No self-citation is used to justify a uniqueness theorem or ansatz. The method is therefore self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CNN embeddings encode useful intra-dataset relations that can be captured by attention and transferred as inductive bias.

axioms (1)

domain assumption A role of convolutional neural network (CNN) in KD is to embed a dataset so as to perform a given task well
Stated directly in the abstract as the premise for needing dataset-level knowledge.

pith-pipeline@v0.9.0 · 5735 in / 1176 out tokens · 29228 ms · 2026-05-25T09:31:25.034998+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 11 internal anchors

[1]

Tensorﬂow: a system for large-scale machine learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016

work page 2016
[2]

Relational inductive biases, deep learning, and graph networks

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vini- cius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Long Short-Term Memory-Networks for Machine Reading

Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Low-resolution face recognition in the wild via selective knowledge distillation

Shiming Ge, Shengwei Zhao, Chenyu Li, and Jia Li. Low-resolution face recognition in the wild via selective knowledge distillation. IEEE Transactions on Image Processing, 28(4):2051–2062, 2019

work page 2051
[7]

Deep pyramidal residual networks

Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 5927–5935, 2017

work page 2017
[8]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[9]

Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons

Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge trans- fer via distillation of activation boundaries formed by hidden neurons. arXiv preprint arXiv:1811.03233, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. LEE AND SONG.: GRAPH-BASED KNOWLEDGE DISTILLA TION 11

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017
[12]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Stochastic estimation of the maximum of a regres- sion function

Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regres- sion function. The Annals of Mathematical Statistics , 23(3):462–466, 1952

work page 1952
[14]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

work page 2009
[15]

Information theory and statistics

Solomon Kullback. Information theory and statistics . Courier Corporation, 1997

work page 1997
[16]

Self-supervised knowledge distillation using singular value decomposition

Seung Hyun Lee, Dae Ha Kim, and Byung Cheol Song. Self-supervised knowledge distillation using singular value decomposition. In European Conference on Computer Vision, pages 339–354. Springer, 2018

work page 2018
[17]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng- Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016

work page 2016
[18]

Effective Approaches to Attention-based Neural Machine Translation

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

Im- proved knowledge distillation via teacher assistant: Bridging the gap between student and teacher

Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, and Hassan Ghasemzadeh. Im- proved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393, 2019

work page arXiv 1902
[20]

Rectiﬁed linear units improve restricted boltzmann machines

Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010

work page 2010
[21]

A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2)

Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2). In Doklady AN USSR , volume 269, pages 543–547, 1983

work page 1983
[22]

Faster r-cnn: Towards real- time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real- time object detection with region proposal networks. InAdvances in neural information processing systems, pages 91–99, 2015

work page 2015
[23]

FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention , pages 234–241. Springer, 2015

work page 2015
[25]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 12 LEE AND SONG.: GRAPH-BASED KNOWLEDGE DISTILLA TION

work page internal anchor Pith review Pith/arXiv arXiv 2014
[26]

Linguistically-Informed Self-Attention for Semantic Role Labeling

Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCal- lum. Linguistically-informed self-attention for semantic role labeling. arXiv preprint arXiv:1804.08199, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017

work page 2017
[28]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018

work page 2018
[29]

A gift from knowledge distilla- tion: Fast optimization, network minimization and transfer learning

Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distilla- tion: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4133–4141, 2017

work page 2017
[30]

Wide Residual Networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. 6 Supplementary Material 6.1 Network Architecture This section describes the network architectures used in this paper. We adopted VGG, WRes- Net, ResNet, and MobileNet as shown in Fig. 4. We sensed feature maps at the front and back of the dotted box, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Tensorﬂow: a system for large-scale machine learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016

work page 2016

[2] [2]

Relational inductive biases, deep learning, and graph networks

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vini- cius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Long Short-Term Memory-Networks for Machine Reading

Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009

[5] [5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Low-resolution face recognition in the wild via selective knowledge distillation

Shiming Ge, Shengwei Zhao, Chenyu Li, and Jia Li. Low-resolution face recognition in the wild via selective knowledge distillation. IEEE Transactions on Image Processing, 28(4):2051–2062, 2019

work page 2051

[7] [7]

Deep pyramidal residual networks

Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 5927–5935, 2017

work page 2017

[8] [8]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[9] [9]

Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons

Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge trans- fer via distillation of activation boundaries formed by hidden neurons. arXiv preprint arXiv:1811.03233, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. LEE AND SONG.: GRAPH-BASED KNOWLEDGE DISTILLA TION 11

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017

[12] [12]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Stochastic estimation of the maximum of a regres- sion function

Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regres- sion function. The Annals of Mathematical Statistics , 23(3):462–466, 1952

work page 1952

[14] [14]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

work page 2009

[15] [15]

Information theory and statistics

Solomon Kullback. Information theory and statistics . Courier Corporation, 1997

work page 1997

[16] [16]

Self-supervised knowledge distillation using singular value decomposition

Seung Hyun Lee, Dae Ha Kim, and Byung Cheol Song. Self-supervised knowledge distillation using singular value decomposition. In European Conference on Computer Vision, pages 339–354. Springer, 2018

work page 2018

[17] [17]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng- Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016

work page 2016

[18] [18]

Effective Approaches to Attention-based Neural Machine Translation

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[19] [19]

Im- proved knowledge distillation via teacher assistant: Bridging the gap between student and teacher

Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, and Hassan Ghasemzadeh. Im- proved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393, 2019

work page arXiv 1902

[20] [20]

Rectiﬁed linear units improve restricted boltzmann machines

Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010

work page 2010

[21] [21]

A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2)

Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2). In Doklady AN USSR , volume 269, pages 543–547, 1983

work page 1983

[22] [22]

Faster r-cnn: Towards real- time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real- time object detection with region proposal networks. InAdvances in neural information processing systems, pages 91–99, 2015

work page 2015

[23] [23]

FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[24] [24]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention , pages 234–241. Springer, 2015

work page 2015

[25] [25]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 12 LEE AND SONG.: GRAPH-BASED KNOWLEDGE DISTILLA TION

work page internal anchor Pith review Pith/arXiv arXiv 2014

[26] [26]

Linguistically-Informed Self-Attention for Semantic Role Labeling

Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCal- lum. Linguistically-informed self-attention for semantic role labeling. arXiv preprint arXiv:1804.08199, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017

work page 2017

[28] [28]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018

work page 2018

[29] [29]

A gift from knowledge distilla- tion: Fast optimization, network minimization and transfer learning

Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distilla- tion: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4133–4141, 2017

work page 2017

[30] [30]

Wide Residual Networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. 6 Supplementary Material 6.1 Network Architecture This section describes the network architectures used in this paper. We adopted VGG, WRes- Net, ResNet, and MobileNet as shown in Fig. 4. We sensed feature maps at the front and back of the dotted box, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2016