Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix

Peng Liu

arxiv: 2112.11447 · v2 · submitted 2021-12-21 · 💻 cs.AI · cs.CV

Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix

Peng Liu This is my paper

Pith reviewed 2026-05-24 12:01 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords multi-modalityknowledge distillationGram matrixmodality relationsteacher-studentknowledge transferneural networks

0 comments

The pith

Learning the teacher's modality-level Gram Matrix lets a student network capture inter-modality relationships that standard output distillation misses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing multi-modality knowledge distillation methods focus only on the teacher's final output, which leaves significant differences in how teacher and student networks process relationships between modalities. The paper proposes a new paradigm where the student learns the teacher's modality-level Gram Matrix to model those relationships explicitly. This is intended to transfer more complete knowledge from teacher to student. A reader would care if this closes the gap more effectively than output-only approaches in tasks involving multiple data types.

Core claim

By adopting a modality relation distillation paradigm that learns the teacher modality-level Gram Matrix, the student network acquires the relationship information among different modalities from the teacher, which addresses the deep differences that remain when only the final output is distilled.

What carries the argument

The modality-level Gram Matrix, which encodes relationship information among different modalities.

Load-bearing premise

Forcing the student to match the teacher's modality-level Gram Matrix will transfer additional knowledge about modality relationships beyond what matching final outputs achieves.

What would settle it

A controlled experiment on a multi-modal dataset where the student trained with Gram Matrix matching shows no gain in task performance or internal similarity metrics over a student trained only on the teacher's final outputs.

Figures

Figures reproduced from arXiv: 2112.11447 by Peng Liu.

**Figure 2.** Figure 2: The modality relationship result of our method. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

In the context of multi-modality knowledge distillation research, the existing methods was mainly focus on the problem of only learning teacher final output. Thus, there are still deep differences between the teacher network and the student network. It is necessary to force the student network to learn the modality relationship information of the teacher network. To effectively exploit transfering knowledge from teachers to students, a novel modality relation distillation paradigm by modeling the relationship information among different modality are adopted, that is learning the teacher modality-level Gram Matrix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gram matrix for modality relations in multi-modal distillation is proposed but unvalidated in the abstract.

read the letter

The key takeaway is that this paper introduces a modality relation distillation method using the teacher's modality-level Gram matrix to capture relationships between modalities, aiming to go beyond standard output-based distillation in multi-modal settings. It does a good job highlighting the potential limitation in existing approaches that only match final outputs, leaving room for deeper differences between teacher and student. Framing the Gram matrix as a way to model those relations is a clear idea. The main issue is that the text provided is just the abstract. There are no equations showing how the modality-level Gram matrix is defined or computed, no description of the loss, and no experiments or comparisons to baselines. Without that, it's impossible to know if this actually improves performance or closes the claimed gaps. Prior work on Gram matrices in style transfer and feature correlations means the novelty needs to be shown through specific adaptation and results, which aren't here. This paper would be of interest to researchers focused on knowledge distillation for multi-modal AI models. Someone looking for new ways to transfer relational knowledge might find the concept worth exploring if the full paper includes solid validation. Based on what's available, it doesn't look ready for peer review. The central claim is unverified. I'd recommend not sending it to referees until there's empirical evidence that the method works.

Referee Report

2 major / 1 minor

Summary. The paper claims that existing multi-modality knowledge distillation methods focus only on the teacher's final output, leaving deep differences between teacher and student networks unaddressed. It proposes a novel modality relation distillation paradigm that models inter-modality relationship information by forcing the student to learn the teacher's modality-level Gram Matrix.

Significance. If the central claim holds and the Gram Matrix matching demonstrably narrows the claimed deep differences beyond standard output distillation, the approach could improve relational knowledge transfer in multi-modal settings. However, the manuscript supplies no derivation, implementation, or validation, so significance cannot be assessed from the given text.

major comments (2)

[Abstract] Abstract: the claim that learning the modality-level Gram Matrix closes 'deep differences' beyond output-only distillation is presented without any definition of the Gram Matrix, loss formulation, derivation, or experimental evidence. This is load-bearing for the central claim.
[Abstract] Abstract: no details are given on how the modality-level Gram Matrix is computed from the teacher's features, how it is used in training, or why it captures modality relationships that standard distillation misses.

minor comments (1)

[Abstract] Abstract contains grammatical issues: 'the existing methods was mainly focus' should read 'existing methods mainly focus'; 'transfering' should be 'transferring'; 'are adopted' should be 'is adopted'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address the major comments below. The abstract will be revised to include more details on the proposed method.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that learning the modality-level Gram Matrix closes 'deep differences' beyond output-only distillation is presented without any definition of the Gram Matrix, loss formulation, derivation, or experimental evidence. This is load-bearing for the central claim.

Authors: We acknowledge that the abstract does not define the Gram Matrix or provide the loss or evidence. The manuscript will be updated to include these in the abstract and to provide the derivation based on feature correlations and experimental results showing the benefit over standard distillation. revision: yes
Referee: [Abstract] Abstract: no details are given on how the modality-level Gram Matrix is computed from the teacher's features, how it is used in training, or why it captures modality relationships that standard distillation misses.

Authors: The current abstract is brief and omits these implementation details. We will revise it to explain that the Gram Matrix is computed as the inner product matrix of modality features from the teacher, used as an additional loss term during training, and captures relationships by encoding cross-modality correlations not addressed by output matching alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes a new modality relation distillation paradigm that forces the student to match the teacher's modality-level Gram Matrix, motivated by the claim that output-only distillation leaves deep differences unaddressed. No equations, parameter-fitting steps, or derivation chains are exhibited in the provided text. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing justifications. The central claim is a straightforward methodological suggestion (adopt Gram Matrix matching as auxiliary loss) that does not reduce to its own inputs by construction and remains open to external empirical validation. This is the most common honest finding for a method-proposal abstract without internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are detailed in the text. The approach implicitly assumes Gram Matrix captures transferable modality relations without stating supporting evidence or assumptions.

pith-pipeline@v0.9.0 · 5596 in / 866 out tokens · 17213 ms · 2026-05-24T12:01:28.804195+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose to model such modality relation to transfer knowledge from teacher to student... G = A·A^T ... Lmr = MSE(At, As)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

novel modality relation distillation paradigm by modeling the relationship information among different modality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 12 internal anchors

[1]

Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, Uniter: Universal image-text representation learning, in: European conference on computer vision, Springer, 2020, pp. 104–120

work page 2020
[2]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Breiman, N

L. Breiman, N. Shang, Born again trees, University of California, Berkeley, Berkeley, CA, Technical Report 1 (2) (1996) 4

work page 1996
[4]

L. J. Ba, R. Caruana, Do deep nets really need to be deep?, arXiv preprint arXiv:1312.6184. 11

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Like What You Like: Knowledge Distill via Neuron Selectivity Transfer

Z. Huang, N. Wang, Like what you like: Knowledge distill via neuron selectivity transfer, arXiv preprint arXiv:1707.01219

work page internal anchor Pith review Pith/arXiv arXiv
[6]

FitNets: Hints for Thin Deep Nets

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, Y. Bengio, Fitnets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

S. Zagoruyko, N. Komodakis, Paying more attention to attention: Improv- ing the performance of convolutional neural networks via attention transfer, arXiv preprint arXiv:1612.03928

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Label Refinery: Improving ImageNet Classification through Label Progression

H. Bagherinezhad, M. Horton, M. Rastegari, A. Farhadi, Label reﬁnery: Improving imagenet classiﬁcation through label progression, arXiv preprint arXiv:1805.02641

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural net- work, arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Furlanello, Z

T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, A. Anandkumar, Born again neural networks, in: International Conference on Machine Learning, PMLR, 2018, pp. 1607–1616

work page 2018
[11]

J. Yim, D. Joo, J. Bae, J. Kim, A gift from knowledge distillation: Fast optimization, network minimization and transfer learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4133–4141

work page 2017
[12]

Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, L. Wang, Learning to navigate for ﬁne-grained classiﬁcation, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 420–435

work page 2018
[13]

Schroﬀ, D

F. Schroﬀ, D. Kalenichenko, J. Philbin, Facenet: A uniﬁed embedding for face recognition and clustering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823

work page 2015
[14]

W. Kim, B. Goyal, K. Chawla, J. Lee, K. Kwon, Attention-based ensemble for deep metric learning, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 736–751. 12

work page 2018
[15]

W. Cao, J. Yuan, Z. He, Z. Zhang, Z. He, Fast deep neural networks with knowledge guided training and predicted regions of interests for real-time video object detection, IEEE Access 6 (2018) 8990–8999

work page 2018
[16]

Krizhevsky, G

A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images

work page
[17]

Vapnik, R

V. Vapnik, R. Izmailov, et al., Learning using privileged information: sim- ilarity control and knowledge transfer., J. Mach. Learn. Res. 16 (1) (2015) 2023–2049

work page 2015
[18]

B. B. Sau, V. N. Balasubramanian, Deep model compression: Distilling knowledge from noisy teachers, arXiv preprint arXiv:1610.09650

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Model compression via distillation and quantization

A. Polino, R. Pascanu, D. Alistarh, Model compression via distillation and quantization, arXiv preprint arXiv:1802.05668

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Zhou, S.-M

Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, P. Frossard, Adaptive quantization for deep neural network, in: Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018
[21]

A. Fan, P. Stock, B. Graham, E. Grave, R. Gribonval, H. Jegou, A. Joulin, Training with quantization noise for extreme model compression, arXiv preprint arXiv:2004.07320

work page arXiv 2004
[22]

M. T. Hansen, S. R. Sharpe, Relativistic, model-independent, three-particle quantization condition, Physical Review D 90 (11) (2014) 116003

work page 2014
[23]

Z. Liu, M. Sun, T. Zhou, G. Huang, T. Darrell, Rethinking the value of network pruning, arXiv preprint arXiv:1810.05270

work page internal anchor Pith review Pith/arXiv arXiv
[24]

M. Zhu, S. Gupta, To prune, or not to prune: exploring the eﬃcacy of pruning for model compression, arXiv preprint arXiv:1710.01878

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Y. Zhu, Y. Wang, Student customized knowledge distillation: Bridging the gap between student and teacher, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5057–5066. 13

work page 2021
[26]

Wang, K.-J

L. Wang, K.-J. Yoon, Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks, IEEE Transactions on Pattern Analysis and Machine Intelligence

work page
[27]

Panchapagesan, D

S. Panchapagesan, D. S. Park, C.-C. Chiu, Y. Shangguan, Q. Liang, A. Gruenstein, Eﬃcient knowledge distillation for rnn-transducer models, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 5639–5643

work page 2021
[28]

X. Chen, B. He, K. Hui, L. Sun, Y. Sun, Simpliﬁed tinybert: Knowledge distillation for document retrieval, in: European Conference on Information Retrieval, Springer, 2021, pp. 241–248

work page 2021
[29]

Shang, B

Y. Shang, B. Duan, Z. Zong, L. Nie, Y. Yan, Lipschitz continuity guided knowledge distillation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10675–10684

work page 2021
[30]

Y. Liu, K. Wang, G. Li, L. Lin, Semantics-aware adaptive knowledge distil- lation for sensor-to-vision action recognition, IEEE Transactions on Image Processing

work page
[31]

B. Zhao, K. Han, Novel visual category discovery with dual ranking statis- tics and mutual knowledge distillation, Advances in Neural Information Processing Systems 34

work page
[32]

S. Sen, N. Moha, B. Baudry, J.-M. J´ ez´ equel, Meta-model pruning, in: Inter- national Conference on Model Driven Engineering Languages and Systems, Springer, 2009, pp. 32–46

work page 2009
[33]

Phuong, C

M. Phuong, C. Lampert, Towards understanding knowledge distillation, in: International Conference on Machine Learning, PMLR, 2019, pp. 5142– 5151

work page 2019
[34]

S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, H. Ghasemzadeh, Improved knowledge distillation via teacher assistant, 14 in: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, Vol. 34, 2020, pp. 5191–5198

work page 2020
[35]

Huang, X

Z. Huang, X. Shen, J. Xing, T. Liu, X. Tian, H. Li, B. Deng, J. Huang, X.- S. Hua, Revisiting knowledge distillation: An inheritance and exploration framework, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3579–3588

work page 2021
[36]

Q. Xu, Z. Chen, K. Wu, C. Wang, M. Wu, X. Li, Kdnet-rul: A knowl- edge distillation framework to compress deep neural networks for machine remaining useful life prediction, IEEE Transactions on Industrial Electron- ics

work page
[37]

Aguilar, Y

G. Aguilar, Y. Ling, Y. Zhang, B. Yao, X. Fan, C. Guo, Knowledge distilla- tion from internal representations, in: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, Vol. 34, 2020, pp. 7350–7357

work page 2020
[38]

G. Xu, Z. Liu, X. Li, C. C. Loy, Knowledge distillation meets self- supervision, in: European Conference on Computer Vision, Springer, 2020, pp. 588–604

work page 2020
[39]

X. Wang, R. Zhang, Y. Sun, J. Qi, Kdgan: Knowledge distillation with generative adversarial networks., in: NeurIPS, 2018, pp. 783–794

work page 2018
[40]

J. Tang, R. Shivanna, Z. Zhao, D. Lin, A. Singh, E. H. Chi, S. Jain, Understanding and improving knowledge distillation, arXiv preprint arXiv:2002.03532

work page arXiv 2002
[41]

Y. Liu, J. Cao, B. Li, C. Yuan, W. Hu, Y. Li, Y. Duan, Knowledge distil- lation via instance relationship graph, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7096– 7104

work page 2019
[42]

Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

A. Mishra, D. Marr, Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy, arXiv preprint arXiv:1711.05852. 15

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, Uniter: Universal image-text representation learning, in: European conference on computer vision, Springer, 2020, pp. 104–120

work page 2020

[2] [2]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Breiman, N

L. Breiman, N. Shang, Born again trees, University of California, Berkeley, Berkeley, CA, Technical Report 1 (2) (1996) 4

work page 1996

[4] [4]

L. J. Ba, R. Caruana, Do deep nets really need to be deep?, arXiv preprint arXiv:1312.6184. 11

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Like What You Like: Knowledge Distill via Neuron Selectivity Transfer

Z. Huang, N. Wang, Like what you like: Knowledge distill via neuron selectivity transfer, arXiv preprint arXiv:1707.01219

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

FitNets: Hints for Thin Deep Nets

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, Y. Bengio, Fitnets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

S. Zagoruyko, N. Komodakis, Paying more attention to attention: Improv- ing the performance of convolutional neural networks via attention transfer, arXiv preprint arXiv:1612.03928

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Label Refinery: Improving ImageNet Classification through Label Progression

H. Bagherinezhad, M. Horton, M. Rastegari, A. Farhadi, Label reﬁnery: Improving imagenet classiﬁcation through label progression, arXiv preprint arXiv:1805.02641

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural net- work, arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Furlanello, Z

T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, A. Anandkumar, Born again neural networks, in: International Conference on Machine Learning, PMLR, 2018, pp. 1607–1616

work page 2018

[11] [11]

J. Yim, D. Joo, J. Bae, J. Kim, A gift from knowledge distillation: Fast optimization, network minimization and transfer learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4133–4141

work page 2017

[12] [12]

Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, L. Wang, Learning to navigate for ﬁne-grained classiﬁcation, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 420–435

work page 2018

[13] [13]

Schroﬀ, D

F. Schroﬀ, D. Kalenichenko, J. Philbin, Facenet: A uniﬁed embedding for face recognition and clustering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823

work page 2015

[14] [14]

W. Kim, B. Goyal, K. Chawla, J. Lee, K. Kwon, Attention-based ensemble for deep metric learning, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 736–751. 12

work page 2018

[15] [15]

W. Cao, J. Yuan, Z. He, Z. Zhang, Z. He, Fast deep neural networks with knowledge guided training and predicted regions of interests for real-time video object detection, IEEE Access 6 (2018) 8990–8999

work page 2018

[16] [16]

Krizhevsky, G

A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images

work page

[17] [17]

Vapnik, R

V. Vapnik, R. Izmailov, et al., Learning using privileged information: sim- ilarity control and knowledge transfer., J. Mach. Learn. Res. 16 (1) (2015) 2023–2049

work page 2015

[18] [18]

B. B. Sau, V. N. Balasubramanian, Deep model compression: Distilling knowledge from noisy teachers, arXiv preprint arXiv:1610.09650

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Model compression via distillation and quantization

A. Polino, R. Pascanu, D. Alistarh, Model compression via distillation and quantization, arXiv preprint arXiv:1802.05668

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Zhou, S.-M

Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, P. Frossard, Adaptive quantization for deep neural network, in: Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018

[21] [21]

A. Fan, P. Stock, B. Graham, E. Grave, R. Gribonval, H. Jegou, A. Joulin, Training with quantization noise for extreme model compression, arXiv preprint arXiv:2004.07320

work page arXiv 2004

[22] [22]

M. T. Hansen, S. R. Sharpe, Relativistic, model-independent, three-particle quantization condition, Physical Review D 90 (11) (2014) 116003

work page 2014

[23] [23]

Z. Liu, M. Sun, T. Zhou, G. Huang, T. Darrell, Rethinking the value of network pruning, arXiv preprint arXiv:1810.05270

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

M. Zhu, S. Gupta, To prune, or not to prune: exploring the eﬃcacy of pruning for model compression, arXiv preprint arXiv:1710.01878

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Y. Zhu, Y. Wang, Student customized knowledge distillation: Bridging the gap between student and teacher, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5057–5066. 13

work page 2021

[26] [26]

Wang, K.-J

L. Wang, K.-J. Yoon, Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks, IEEE Transactions on Pattern Analysis and Machine Intelligence

work page

[27] [27]

Panchapagesan, D

S. Panchapagesan, D. S. Park, C.-C. Chiu, Y. Shangguan, Q. Liang, A. Gruenstein, Eﬃcient knowledge distillation for rnn-transducer models, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 5639–5643

work page 2021

[28] [28]

X. Chen, B. He, K. Hui, L. Sun, Y. Sun, Simpliﬁed tinybert: Knowledge distillation for document retrieval, in: European Conference on Information Retrieval, Springer, 2021, pp. 241–248

work page 2021

[29] [29]

Shang, B

Y. Shang, B. Duan, Z. Zong, L. Nie, Y. Yan, Lipschitz continuity guided knowledge distillation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10675–10684

work page 2021

[30] [30]

Y. Liu, K. Wang, G. Li, L. Lin, Semantics-aware adaptive knowledge distil- lation for sensor-to-vision action recognition, IEEE Transactions on Image Processing

work page

[31] [31]

B. Zhao, K. Han, Novel visual category discovery with dual ranking statis- tics and mutual knowledge distillation, Advances in Neural Information Processing Systems 34

work page

[32] [32]

S. Sen, N. Moha, B. Baudry, J.-M. J´ ez´ equel, Meta-model pruning, in: Inter- national Conference on Model Driven Engineering Languages and Systems, Springer, 2009, pp. 32–46

work page 2009

[33] [33]

Phuong, C

M. Phuong, C. Lampert, Towards understanding knowledge distillation, in: International Conference on Machine Learning, PMLR, 2019, pp. 5142– 5151

work page 2019

[34] [34]

S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, H. Ghasemzadeh, Improved knowledge distillation via teacher assistant, 14 in: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, Vol. 34, 2020, pp. 5191–5198

work page 2020

[35] [35]

Huang, X

Z. Huang, X. Shen, J. Xing, T. Liu, X. Tian, H. Li, B. Deng, J. Huang, X.- S. Hua, Revisiting knowledge distillation: An inheritance and exploration framework, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3579–3588

work page 2021

[36] [36]

Q. Xu, Z. Chen, K. Wu, C. Wang, M. Wu, X. Li, Kdnet-rul: A knowl- edge distillation framework to compress deep neural networks for machine remaining useful life prediction, IEEE Transactions on Industrial Electron- ics

work page

[37] [37]

Aguilar, Y

G. Aguilar, Y. Ling, Y. Zhang, B. Yao, X. Fan, C. Guo, Knowledge distilla- tion from internal representations, in: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, Vol. 34, 2020, pp. 7350–7357

work page 2020

[38] [38]

G. Xu, Z. Liu, X. Li, C. C. Loy, Knowledge distillation meets self- supervision, in: European Conference on Computer Vision, Springer, 2020, pp. 588–604

work page 2020

[39] [39]

X. Wang, R. Zhang, Y. Sun, J. Qi, Kdgan: Knowledge distillation with generative adversarial networks., in: NeurIPS, 2018, pp. 783–794

work page 2018

[40] [40]

J. Tang, R. Shivanna, Z. Zhao, D. Lin, A. Singh, E. H. Chi, S. Jain, Understanding and improving knowledge distillation, arXiv preprint arXiv:2002.03532

work page arXiv 2002

[41] [41]

Y. Liu, J. Cao, B. Li, C. Yuan, W. Hu, Y. Li, Y. Duan, Knowledge distil- lation via instance relationship graph, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7096– 7104

work page 2019

[42] [42]

Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

A. Mishra, D. Marr, Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy, arXiv preprint arXiv:1711.05852. 15

work page internal anchor Pith review Pith/arXiv arXiv