BicKD: Bilateral Contrastive Knowledge Distillation

Hong kyu Lee; Jiangnan Zhu; Junxu Liu; Li Xiong; Yixuan Liu; Yujie Gu; Yukai Xu

arxiv: 2602.01265 · v2 · submitted 2026-02-01 · 💻 cs.LG

BicKD: Bilateral Contrastive Knowledge Distillation

Jiangnan Zhu , Yukai Xu , Li Xiong , Yixuan Liu , Junxu Liu , Hong kyu Lee , Yujie Gu This is my paper

Pith reviewed 2026-05-16 08:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords knowledge distillationcontrastive lossbilateral comparisonprobability alignmentclass orthogonalitymodel compressionteacher-student learning

0 comments

The pith

BicKD adds bilateral contrastive loss to align sample-wise and class-wise predictions while enforcing orthogonality in probability spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vanilla knowledge distillation aligns teacher and student outputs only sample by sample, without class-level comparisons or rules on how the full probability distribution is shaped. BicKD introduces a bilateral contrastive loss that compares both individual samples and entire classes, pushing different class generalization spaces to become orthogonal while keeping predictions consistent inside each class. This adds geometric regularization to the student’s predictive distribution. Experiments across model architectures and standard benchmarks show the method improves knowledge transfer and outperforms prior distillation techniques.

Core claim

The paper establishes that inserting a bilateral contrastive loss into knowledge distillation intensifies orthogonality among different class generalization spaces while preserving consistency within the same class, thereby enabling explicit sample-wise and class-wise comparison of teacher and student predictions and regularizing the geometric structure of the probability space.

What carries the argument

Bilateral contrastive loss that performs dual-directional comparisons to enforce inter-class orthogonality and intra-class consistency in the output probability space.

If this is right

Student models capture both per-sample and per-class teacher knowledge more effectively.
Performance gains appear consistently across different neural network architectures.
The predictive distribution receives extra geometric regularization at no added model cost.
The method applies directly to logit-based distillation pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bilateral structure could be combined with feature-based or attention-based distillation variants.
Enforcing class-space orthogonality may reduce errors between visually similar classes in fine-grained tasks.
The same loss pattern might transfer to self-distillation or semi-supervised settings where no strong teacher exists.

Load-bearing premise

That the added bilateral contrastive loss supplies a helpful structural constraint on the probability space without creating new overfitting risks or lowering accuracy on some datasets or architectures.

What would settle it

Training a student model with BicKD and finding it matches or underperforms vanilla KD on a standard benchmark such as CIFAR-100 with a ResNet teacher-student pair.

Figures

Figures reproduced from arXiv: 2602.01265 by Hong kyu Lee, Jiangnan Zhu, Junxu Liu, Li Xiong, Yixuan Liu, Yujie Gu, Yukai Xu.

**Figure 2.** Figure 2: Bilateral contrast used in BicKD. The sample-wise contrast loss is defined to align the same sample/row of teacher [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Cosine similarity matrices of inter-class predictions [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy (%) of BicKD and Vanilla KD on CIFAR [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Knowledge distillation (KD) is a machine learning framework that transfers knowledge from a teacher model to a student model. The vanilla KD proposed by Hinton et al. has been the dominant approach in logit-based distillation and demonstrates compelling performance. However, it only performs sample-wise probability alignment between teacher and student's predictions, lacking an mechanism for class-wise comparison. Besides, vanilla KD imposes no structural constraint on the probability space. In this work, we propose a simple yet effective methodology, bilateral contrastive knowledge distillation (BicKD). This approach introduces a novel bilateral contrastive loss, which intensifies the orthogonality among different class generalization spaces while preserving consistency within the same class. The bilateral formulation enables explicit comparison of both sample-wise and class-wise prediction patterns between teacher and student. By emphasizing probabilistic orthogonality, BicKD further regularizes the geometric structure of the predictive distribution. Extensive experiments show that our BicKD method enhances knowledge transfer, and consistently outperforms state-of-the-art knowledge distillation techniques across various model architectures and benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BicKD adds a class-wise contrastive term to vanilla KD for better output-space geometry and reports consistent benchmark gains, but the improvements look incremental and the evaluation needs more controls.

read the letter

BicKD takes the standard Hinton KD loss and layers on a bilateral contrastive term that aligns teacher and student predictions both sample-wise and class-wise. The class-wise part pushes different classes toward orthogonality in probability space while keeping same-class outputs consistent. That bilateral structure is the main novelty relative to plain logit matching or earlier contrastive KD variants mentioned in the abstract.

Referee Report

2 major / 2 minor

Summary. The paper introduces BicKD, a knowledge distillation framework that augments vanilla logit-based KD with a bilateral contrastive loss. This loss simultaneously enforces intra-class consistency (positive pairs) and inter-class orthogonality (negative pairs) on the teacher's and student's predictive distributions, providing an explicit class-wise structural constraint absent in standard KD. The authors report that the resulting method yields consistent performance gains over prior KD techniques across multiple model architectures and benchmarks.

Significance. If the empirical gains hold under rigorous controls, the bilateral contrastive term offers a lightweight, geometry-aware regularizer for the output probability simplex that could improve transfer without changing model capacity. The approach is conceptually simple and does not introduce new free parameters beyond a temperature scalar, which is a strength relative to many contrastive KD variants.

major comments (2)

[§3.2] §3.2, Eq. (4)–(6): the bilateral contrastive loss is defined with a single temperature τ shared across positive and negative terms, yet the manuscript provides no ablation on whether τ must be retuned per dataset or architecture; if retuning is required, this undermines the claim that BicKD is a drop-in improvement with no additional hyperparameter burden.
[§4.2] §4.2, Table 2: the reported accuracy deltas versus vanilla KD are shown without standard deviations or results from multiple random seeds; for several entries the absolute gain is <1%, making it impossible to determine whether the bilateral term produces statistically reliable improvement or merely reflects run-to-run variance.

minor comments (2)

[Abstract] The abstract states that BicKD 'consistently outperforms' SOTA methods, but the experimental section should explicitly list the exact baselines (e.g., which versions of CRD, ReviewKD, etc.) and confirm that all methods use identical training schedules and data augmentations.
[§3] Notation for the class-wise contrastive term is introduced without a clear diagram; a small schematic showing positive/negative pair construction would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of BicKD. We address the major comments below and will incorporate revisions to enhance the manuscript's rigor.

read point-by-point responses

Referee: [§3.2] §3.2, Eq. (4)–(6): the bilateral contrastive loss is defined with a single temperature τ shared across positive and negative terms, yet the manuscript provides no ablation on whether τ must be retuned per dataset or architecture; if retuning is required, this undermines the claim that BicKD is a drop-in improvement with no additional hyperparameter burden.

Authors: We appreciate this point. In the original experiments, the temperature τ was set to the same value used for the vanilla KD baseline (as detailed in the experimental setup), without any additional tuning specific to the contrastive loss. This choice was consistent across all datasets and architectures tested, supporting the drop-in applicability. To further validate this, we will add an ablation study in the revised version demonstrating the robustness of BicKD to the choice of τ, showing that performance remains stable without per-dataset retuning. revision: yes
Referee: [§4.2] §4.2, Table 2: the reported accuracy deltas versus vanilla KD are shown without standard deviations or results from multiple random seeds; for several entries the absolute gain is <1%, making it impossible to determine whether the bilateral term produces statistically reliable improvement or merely reflects run-to-run variance.

Authors: We agree that reporting standard deviations and results from multiple seeds would strengthen the statistical validity of the results. The current Table 2 reports single-run accuracies with fixed seeds for reproducibility. In the revision, we will conduct experiments with multiple random seeds and include mean accuracies along with standard deviations in Table 2 to demonstrate the reliability of the improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines BicKD by introducing a bilateral contrastive loss that applies standard contrastive principles (intra-class consistency and inter-class orthogonality) to the predictive distributions of teacher and student models. No equations or derivation steps are shown that reduce the loss term, its claimed structural constraint, or the performance gains to a fitted parameter from the target data or to a self-citation chain. The central claim rests on experimental outperformance rather than any closed-form identity or redefinition of inputs as outputs. Self-citations, if present for background KD methods, are not load-bearing for the novel loss construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; standard KD assumptions such as temperature scaling are likely present but unspecified.

pith-pipeline@v0.9.0 · 5487 in / 993 out tokens · 37782 ms · 2026-05-16T08:46:08.317242+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 1 internal anchor

[1]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

work page 2009
[2]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” CoRR abs/1804.02767, pp. 1–6, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inNAACL-HLT, 2019, pp. 4171–4186

work page 2019
[4]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,” inNIPS, 2020, pp. 1877–1901

work page 2020
[5]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” inICASSP, 2016, pp. 4960–4964

work page 2016
[6]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NIPS, 2020, pp. 12 449–12 460

work page 2020
[7]

Distilling the knowledge in a neural network,

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” inNIPS Deep Learning and Representation Learning Workshop., 2015, pp. 1–9

work page 2015
[8]

Revisiting orthogonality regularization: A study for convolutional neural networks in image classification,

T. Kim and S.-Y . Yun, “Revisiting orthogonality regularization: A study for convolutional neural networks in image classification,”IEEE Access, vol. 10, pp. 69 741–69 749, 2022

work page 2022
[9]

Variational information distillation for knowledge transfer,

S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, “Variational information distillation for knowledge transfer,” inCVPR, 2019, pp. 9163–9171

work page 2019
[10]

Fitnets: Hints for thin deep nets,

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y . Bengio, “Fitnets: Hints for thin deep nets,” inICLR, 2015, pp. 1–13

work page 2015
[11]

Relational knowledge distilla- tion,

W. Park, D. Kim, Y . Lu, and M. Cho, “Relational knowledge distilla- tion,” inCVPR, 2019, pp. 3967–3976

work page 2019
[12]

Similarity-preserving knowledge distillation,

F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in ICCV, 2019, pp. 1365–1374

work page 2019
[13]

Contrastive representation distilla- tion,

Y . Tian, D. Krishnan, and P. Isola, “Contrastive representation distilla- tion,” inICLR, 2020, pp. 1–19

work page 2020
[14]

Knowledge distillation from a stronger teacher,

T. Huang, S. You, F. Wang, C. Qian, and C. Xu, “Knowledge distillation from a stronger teacher,” inNIPS, 2022, pp. 33 716–33 727

work page 2022
[15]

Multi-level logit distillation,

Y . Jin, J. Wang, and D. Lin, “Multi-level logit distillation,” inCVPR, 2023, pp. 24 276–24 285

work page 2023
[16]

Knowl- edge distillation with refined logits,

W. Sun, D. Chen, S. Lyu, G. Chen, C. Chen, and C. Wang, “Knowl- edge distillation with refined logits,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 1110–1119

work page 2025
[17]

Wasserstein distance rivals kullback-leibler divergence for knowledge distillation,

J. Lv, H. Yang, and P. Li, “Wasserstein distance rivals kullback-leibler divergence for knowledge distillation,”ArXiv, vol. abs/2412.08139,

work page arXiv
[18]

Available: https://api.semanticscholar.org/CorpusID: 274638743

[Online]. Available: https://api.semanticscholar.org/CorpusID: 274638743

work page
[19]

Knowledge distillation based on transformed teacher matching,

K. Zheng and E.-H. Yang, “Knowledge distillation based on transformed teacher matching,”arXiv preprint arXiv:2402.11148, 2024

work page arXiv 2024
[20]

Continual learning with knowl- edge distillation: A survey,

S. Li, T. Su, X.-Y . Zhang, and Z. Wang, “Continual learning with knowl- edge distillation: A survey,”IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21, 2024

work page 2024
[21]

Boosting knowledge distillation via intra- class logit distribution smoothing,

C. Li, G. Cheng, and J. Han, “Boosting knowledge distillation via intra- class logit distribution smoothing,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4190–4201, 2024

work page 2024
[22]

Parameter-efficient and student-friendly knowledge distillation,

J. Rao, X. Meng, L. Ding, S. Qi, X. Liu, M. Zhang, and D. Tao, “Parameter-efficient and student-friendly knowledge distillation,”IEEE Transactions on Multimedia, vol. 26, pp. 4230–4241, 2024

work page 2024
[23]

Paying more attention to attention: Improving the performance of convolutional neural networks via atten- tion transfer,

S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via atten- tion transfer,” inICLR, 2017, pp. 1–13

work page 2017
[24]

Correlation congruence for knowledge distillation,

B. Peng, X. Jin, J. Liu, D. Li, Y . Wu, Y . Liu, S. Zhou, and Z. Zhang, “Correlation congruence for knowledge distillation,” inICCV, 2019, pp. 5007–5016

work page 2019
[25]

Improving knowledge distillation via category structure,

Z. Chen, X. Zheng, H. Shen, Z. Zeng, Y . Zhou, and R. Zhao, “Improving knowledge distillation via category structure,” inECCV, 2020, pp. 205– 219

work page 2020
[26]

Hierarchical self-supervised augmented knowledge distillation,

C. Yang, Z. An, L. Cai, and Y . Xu, “Hierarchical self-supervised augmented knowledge distillation,” inIJCAI, 2021, pp. 1217–1223

work page 2021
[27]

Knowledge distillation using hierarchical self-supervision aug- mented distribution,

——, “Knowledge distillation using hierarchical self-supervision aug- mented distribution,”IEEE Transactions on Neural Networks and Learn- ing Systems, vol. 35, no. 2, pp. 2094–2108, 2024

work page 2094
[28]

Skill- transferring knowledge distillation method,

S. Yang, L. Xu, M. Zhou, X. Yang, J. Yang, and Z. Huang, “Skill- transferring knowledge distillation method,”IEEE Transactions on Cir- cuits and Systems for Video Technology, vol. 33, no. 11, pp. 6487–6502, 2023

work page 2023
[29]

Structured knowledge distillation for accurate and efficient object detection,

L. Zhang and K. Ma, “Structured knowledge distillation for accurate and efficient object detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 706–15 724, 2023

work page 2023
[30]

Contrastive multiview coding,

Y . Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in ECCV, 2020, pp. 776–794

work page 2020
[31]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,”Technical Report, 2009

work page 2009
[32]

Tiny imagenet visual recognition challenge,

Y . Le and X. Yang, “Tiny imagenet visual recognition challenge,”CS 231N, vol. 7, no. 7, p. 3, 2015

work page 2015
[33]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” inICLR, 2015, pp. 1–14

work page 2015
[34]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR, 2016, pp. 770–778

work page 2016
[35]

Wide residual networks,

S. Zagoruyko and N. Komodakis, “Wide residual networks,” inBMVC, 2016, pp. 1–15

work page 2016
[36]

Aggregated residual transformations for deep neural networks,

S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” inCVPR, 2017, pp. 5987– 5995

work page 2017
[37]

Shufflenet v2: Practical guidelines for efficient cnn architecture design,

N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” inECCV, 2018, pp. 116–131

work page 2018
[38]

Mnasnet: Platform-aware neural architecture search for mobile

M. Tan, B. Chen, R. Pang, V . Vasudevan, M. Sandler, A. Howard, and Q. V . Le, “Mnasnet: Platform-aware neural architecture search for mobile.” inCVPR, 2019, pp. 2820–2828

work page 2019
[39]

Meta-learning with differentiable closed-form solvers,

L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi, “Meta-learning with differentiable closed-form solvers,” inICLR, 2018, pp. 1–15

work page 2018
[40]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

work page 2009
[41]

Prevalence of neural collapse during the terminal phase of deep learning training,

V . Papyan, X. Han, and D. L. Donoho, “Prevalence of neural collapse during the terminal phase of deep learning training,”Proceedings of the National Academy of Sciences, vol. 117, no. 40, pp. 24 652–24 663, 2020

work page 2020

[1] [1]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

work page 2009

[2] [2]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” CoRR abs/1804.02767, pp. 1–6, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inNAACL-HLT, 2019, pp. 4171–4186

work page 2019

[4] [4]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,” inNIPS, 2020, pp. 1877–1901

work page 2020

[5] [5]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” inICASSP, 2016, pp. 4960–4964

work page 2016

[6] [6]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NIPS, 2020, pp. 12 449–12 460

work page 2020

[7] [7]

Distilling the knowledge in a neural network,

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” inNIPS Deep Learning and Representation Learning Workshop., 2015, pp. 1–9

work page 2015

[8] [8]

Revisiting orthogonality regularization: A study for convolutional neural networks in image classification,

T. Kim and S.-Y . Yun, “Revisiting orthogonality regularization: A study for convolutional neural networks in image classification,”IEEE Access, vol. 10, pp. 69 741–69 749, 2022

work page 2022

[9] [9]

Variational information distillation for knowledge transfer,

S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, “Variational information distillation for knowledge transfer,” inCVPR, 2019, pp. 9163–9171

work page 2019

[10] [10]

Fitnets: Hints for thin deep nets,

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y . Bengio, “Fitnets: Hints for thin deep nets,” inICLR, 2015, pp. 1–13

work page 2015

[11] [11]

Relational knowledge distilla- tion,

W. Park, D. Kim, Y . Lu, and M. Cho, “Relational knowledge distilla- tion,” inCVPR, 2019, pp. 3967–3976

work page 2019

[12] [12]

Similarity-preserving knowledge distillation,

F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in ICCV, 2019, pp. 1365–1374

work page 2019

[13] [13]

Contrastive representation distilla- tion,

Y . Tian, D. Krishnan, and P. Isola, “Contrastive representation distilla- tion,” inICLR, 2020, pp. 1–19

work page 2020

[14] [14]

Knowledge distillation from a stronger teacher,

T. Huang, S. You, F. Wang, C. Qian, and C. Xu, “Knowledge distillation from a stronger teacher,” inNIPS, 2022, pp. 33 716–33 727

work page 2022

[15] [15]

Multi-level logit distillation,

Y . Jin, J. Wang, and D. Lin, “Multi-level logit distillation,” inCVPR, 2023, pp. 24 276–24 285

work page 2023

[16] [16]

Knowl- edge distillation with refined logits,

W. Sun, D. Chen, S. Lyu, G. Chen, C. Chen, and C. Wang, “Knowl- edge distillation with refined logits,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 1110–1119

work page 2025

[17] [17]

Wasserstein distance rivals kullback-leibler divergence for knowledge distillation,

J. Lv, H. Yang, and P. Li, “Wasserstein distance rivals kullback-leibler divergence for knowledge distillation,”ArXiv, vol. abs/2412.08139,

work page arXiv

[18] [18]

Available: https://api.semanticscholar.org/CorpusID: 274638743

[Online]. Available: https://api.semanticscholar.org/CorpusID: 274638743

work page

[19] [19]

Knowledge distillation based on transformed teacher matching,

K. Zheng and E.-H. Yang, “Knowledge distillation based on transformed teacher matching,”arXiv preprint arXiv:2402.11148, 2024

work page arXiv 2024

[20] [20]

Continual learning with knowl- edge distillation: A survey,

S. Li, T. Su, X.-Y . Zhang, and Z. Wang, “Continual learning with knowl- edge distillation: A survey,”IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21, 2024

work page 2024

[21] [21]

Boosting knowledge distillation via intra- class logit distribution smoothing,

C. Li, G. Cheng, and J. Han, “Boosting knowledge distillation via intra- class logit distribution smoothing,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4190–4201, 2024

work page 2024

[22] [22]

Parameter-efficient and student-friendly knowledge distillation,

J. Rao, X. Meng, L. Ding, S. Qi, X. Liu, M. Zhang, and D. Tao, “Parameter-efficient and student-friendly knowledge distillation,”IEEE Transactions on Multimedia, vol. 26, pp. 4230–4241, 2024

work page 2024

[23] [23]

Paying more attention to attention: Improving the performance of convolutional neural networks via atten- tion transfer,

S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via atten- tion transfer,” inICLR, 2017, pp. 1–13

work page 2017

[24] [24]

Correlation congruence for knowledge distillation,

B. Peng, X. Jin, J. Liu, D. Li, Y . Wu, Y . Liu, S. Zhou, and Z. Zhang, “Correlation congruence for knowledge distillation,” inICCV, 2019, pp. 5007–5016

work page 2019

[25] [25]

Improving knowledge distillation via category structure,

Z. Chen, X. Zheng, H. Shen, Z. Zeng, Y . Zhou, and R. Zhao, “Improving knowledge distillation via category structure,” inECCV, 2020, pp. 205– 219

work page 2020

[26] [26]

Hierarchical self-supervised augmented knowledge distillation,

C. Yang, Z. An, L. Cai, and Y . Xu, “Hierarchical self-supervised augmented knowledge distillation,” inIJCAI, 2021, pp. 1217–1223

work page 2021

[27] [27]

Knowledge distillation using hierarchical self-supervision aug- mented distribution,

——, “Knowledge distillation using hierarchical self-supervision aug- mented distribution,”IEEE Transactions on Neural Networks and Learn- ing Systems, vol. 35, no. 2, pp. 2094–2108, 2024

work page 2094

[28] [28]

Skill- transferring knowledge distillation method,

S. Yang, L. Xu, M. Zhou, X. Yang, J. Yang, and Z. Huang, “Skill- transferring knowledge distillation method,”IEEE Transactions on Cir- cuits and Systems for Video Technology, vol. 33, no. 11, pp. 6487–6502, 2023

work page 2023

[29] [29]

Structured knowledge distillation for accurate and efficient object detection,

L. Zhang and K. Ma, “Structured knowledge distillation for accurate and efficient object detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 706–15 724, 2023

work page 2023

[30] [30]

Contrastive multiview coding,

Y . Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in ECCV, 2020, pp. 776–794

work page 2020

[31] [31]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,”Technical Report, 2009

work page 2009

[32] [32]

Tiny imagenet visual recognition challenge,

Y . Le and X. Yang, “Tiny imagenet visual recognition challenge,”CS 231N, vol. 7, no. 7, p. 3, 2015

work page 2015

[33] [33]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” inICLR, 2015, pp. 1–14

work page 2015

[34] [34]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR, 2016, pp. 770–778

work page 2016

[35] [35]

Wide residual networks,

S. Zagoruyko and N. Komodakis, “Wide residual networks,” inBMVC, 2016, pp. 1–15

work page 2016

[36] [36]

Aggregated residual transformations for deep neural networks,

S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” inCVPR, 2017, pp. 5987– 5995

work page 2017

[37] [37]

Shufflenet v2: Practical guidelines for efficient cnn architecture design,

N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” inECCV, 2018, pp. 116–131

work page 2018

[38] [38]

Mnasnet: Platform-aware neural architecture search for mobile

M. Tan, B. Chen, R. Pang, V . Vasudevan, M. Sandler, A. Howard, and Q. V . Le, “Mnasnet: Platform-aware neural architecture search for mobile.” inCVPR, 2019, pp. 2820–2828

work page 2019

[39] [39]

Meta-learning with differentiable closed-form solvers,

L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi, “Meta-learning with differentiable closed-form solvers,” inICLR, 2018, pp. 1–15

work page 2018

[40] [40]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

work page 2009

[41] [41]

Prevalence of neural collapse during the terminal phase of deep learning training,

V . Papyan, X. Han, and D. L. Donoho, “Prevalence of neural collapse during the terminal phase of deep learning training,”Proceedings of the National Academy of Sciences, vol. 117, no. 40, pp. 24 652–24 663, 2020

work page 2020