Distilling Vision Transformers for Distortion-Robust Representation Learning

Dimitrios Gunopulos; Giorgos Giannopoulos; Konstantinos Alexis

arxiv: 2604.22529 · v1 · submitted 2026-04-24 · 💻 cs.CV

Distilling Vision Transformers for Distortion-Robust Representation Learning

Konstantinos Alexis , Giorgos Giannopoulos , Dimitrios Gunopulos This is my paper

Pith reviewed 2026-05-08 12:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision transformersknowledge distillationdistortion robustnessrepresentation learningimage classificationself-supervised learningrobust features

0 comments

The pith

A student Vision Transformer approximates clean-image representations from distorted inputs alone through multi-level distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a student Vision Transformer can produce representations close to those of clean images even when it only ever receives distorted versions of the same images. This happens in an asymmetric setup where a teacher model processes clean images and knowledge is transferred by aligning global embeddings, patch-level features, and attention maps. A reader would care because clean data is often scarce or unavailable, yet many real-world tasks involve distorted observations such as noisy or compressed images. The student is initialized from the same pretrained Vision Transformer as the teacher and, after this process, outperforms prior methods on image classification under distortions when using the same amount of labeled supervision.

Core claim

In an asymmetric knowledge distillation framework, both teacher and student begin from the same pretrained Vision Transformer. The teacher receives clean images while the student receives their distorted counterparts. Multi-level distillation aligns the global embeddings, the patch-level features, and the attention maps, enabling the student to approximate the representations that the teacher would produce on clean data.

What carries the argument

Asymmetric multi-level knowledge distillation that aligns global embeddings, patch features, and attention maps between a clean teacher and a distorted student.

Load-bearing premise

Aligning global embeddings, patch features, and attention maps at multiple levels transfers the clean representations without needing extra regularization or architecture changes.

What would settle it

Compute the cosine similarity between the student's embedding on a distorted image and the teacher's embedding on the matching clean image; if this similarity fails to rise above baseline levels after training, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.22529 by Dimitrios Gunopulos, Giorgos Giannopoulos, Konstantinos Alexis.

**Figure 1.** Figure 1: Overview of our multi-target distillation framework. A frozen teacher ViT processes clean images while a student ViT of identical architecture receives their distorted counterparts, being trained to align with the teacher at three complementary levels: global embeddings, patch-level features, and attention maps. In this way, the student learns to recover clean semantic representations directly from corrup… view at source ↗

**Figure 2.** Figure 2: Robustness under increasing distortions. Top-1 classification accuracy (%) on the ImageNet-100 validation set as the distortion intensity exceeds the levels used during training. Models are fine-tuned on random masking (90%), Gaussian noise (σ = 0.5), and Gaussian blur (kernel size 37), respectively. leads to representations that remain more stable and informative under severe input perturbations. 4.8 Labe… view at source ↗

**Figure 3.** Figure 3: Label-efficient performance under distortions. Top-1 classification accuracy (%) on ImageNet-100 validation images for varying fractions of training labels (log-scale xaxis) view at source ↗

**Figure 4.** Figure 4: Attention maps of our distilled student models and the supervised baselines on ImageNet-100 validation images, under different distortion types. For each example, we show the original image, the distorted input, and the attention maps predicted by the supervised baseline and the encoder trained by our method. 4.9 Attention Map Analysis view at source ↗

read the original abstract

Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this paper, we demonstrate that pretrained vision models can be leveraged to learn distortion-robust representations, which can then be effectively applied to downstream tasks operating on distorted observations. In particular, we propose an asymmetric knowledge distillation framework in which both teacher and student are initialized from the same pretrained Vision Transformer but receive different views of each image: the teacher processes clean images, while the student sees their distorted versions. We introduce multi-level distillation that aligns global embeddings, patch-level features, and attention maps and show that the student is able to approximate clean-image representations despite never directly accessing clean data. We evaluate our approach on image classification tasks across several datasets and under various distortions, consistently outperforming existing alternatives for the same amount of human supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The asymmetric clean-teacher to distorted-student distillation with multi-level alignment is a practical incremental idea, but the abstract alone leaves the robustness claim hard to verify.

read the letter

The paper's main point is that you can take a pretrained ViT, let the teacher see clean images and the student see only distorted versions, then align them at global embeddings, patch features, and attention maps so the student ends up with cleaner-like representations. This setup is new in its specific combination for distortion robustness, and it does a reasonable job framing a real deployment issue where clean data is scarce. The approach stays simple, reuses existing pretraining, and avoids new architectures, which keeps the contribution focused and easy to implement if the results hold.

Referee Report

2 major / 1 minor

Summary. The paper proposes an asymmetric knowledge distillation framework for Vision Transformers to learn distortion-robust representations. Both teacher and student are initialized from the same pretrained ViT; the teacher receives clean images while the student receives distorted versions. Multi-level distillation aligns global embeddings, patch-level features, and attention maps, with the central claim that the student approximates clean-image representations without ever accessing clean data and consistently outperforms existing alternatives on image classification under distortions for the same level of supervision.

Significance. If the empirical results hold with proper controls, the work offers a practical way to transfer robustness from clean pretrained models to distorted inputs via distillation, which could be useful in domains where clean observations are scarce. The approach builds directly on standard supervised and distillation losses without introducing new free parameters or architectural modifications, and the multi-level alignment idea is a natural extension of prior ViT distillation techniques.

major comments (2)

[Abstract / Method] Abstract and method description: the claim that multi-level alignment of global embeddings, patch features, and attention maps is sufficient to transfer distortion robustness rests on the untested assumption that the student cannot satisfy the losses via distortion-specific compensations that match the teacher only on training distortions. No ablation or analysis is provided to rule out such shortcut solutions, which is load-bearing for the central claim that the student approximates clean representations.
[Abstract] Abstract: the statement of 'consistent outperformance' and 'several datasets and various distortions' is presented without any quantitative numbers, error bars, ablation studies, or statistical significance tests, leaving the soundness of the empirical support unverifiable from the provided text.

minor comments (1)

The manuscript would benefit from explicit statements of the exact distortion types, severity levels, and dataset splits used in the experiments to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing our responses and indicating the revisions we will make to improve the paper.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the claim that multi-level alignment of global embeddings, patch features, and attention maps is sufficient to transfer distortion robustness rests on the untested assumption that the student cannot satisfy the losses via distortion-specific compensations that match the teacher only on training distortions. No ablation or analysis is provided to rule out such shortcut solutions, which is load-bearing for the central claim that the student approximates clean representations.

Authors: We agree that ruling out shortcut solutions is important for substantiating the central claim. The multi-level distillation, especially attention map alignment, is designed to promote structural similarity in how the student processes inputs rather than allowing superficial compensations. Nevertheless, we acknowledge the lack of explicit analysis in the current version. In the revised manuscript, we will add an ablation that evaluates the student on distortion types held out from training and reports direct representation similarity (e.g., cosine similarity of embeddings) between the student on distorted inputs and the teacher on clean inputs, along with downstream task performance under these conditions. This will provide evidence that the learned representations generalize beyond training distortions. revision: yes
Referee: [Abstract] Abstract: the statement of 'consistent outperformance' and 'several datasets and various distortions' is presented without any quantitative numbers, error bars, ablation studies, or statistical significance tests, leaving the soundness of the empirical support unverifiable from the provided text.

Authors: The abstract is intentionally concise to highlight the main contributions. To improve verifiability, we will revise it to include specific quantitative highlights, such as the average accuracy improvement over the strongest baseline across the reported datasets and distortions. Full tables with error bars from multiple random seeds, ablation studies, and statistical significance tests are already included in the experimental results section and supplementary material; we will ensure clearer cross-references from the abstract and introduction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical distillation framework is self-contained

full rationale

The paper describes an empirical asymmetric knowledge distillation setup with multi-level alignment losses (global embeddings, patch features, attention maps) whose targets are supplied by an external pretrained teacher ViT on clean images. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed that reduces to fitted parameters or self-citations by construction. All components are standard supervised/distillation objectives evaluated on held-out distortions; the central claim rests on experimental outperformance rather than any tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that a pretrained ViT already encodes useful clean representations and that matching at three feature levels transfers robustness; no new entities are postulated and no parameters are fitted beyond standard training hyperparameters.

axioms (2)

domain assumption Pretrained Vision Transformers produce high-quality representations on clean images
Invoked when the teacher is initialized from a pretrained ViT and its outputs are treated as targets.
ad hoc to paper Matching global embeddings, patch features, and attention maps is sufficient to transfer robustness
Central design choice of the multi-level distillation objective.

pith-pipeline@v0.9.0 · 5450 in / 1332 out tokens · 31013 ms · 2026-05-08T12:20:52.262996+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

Image processing on line1, 208–212 (2011)

Buades, A., Coll, B., Morel, J.M.: Non-local means denoising. Image processing on line1, 208–212 (2011)

work page 2011
[2]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021) Distilling Vision Transformers for Distortion Robustness 15

work page 2021
[3]

In: International conference on machine learning

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607 (2020)

work page 2020
[4]

Chen,X.,He,K.:Exploringsimplesiameserepresentationlearning.In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021)

work page 2021
[5]

In: 2016 Eighth International Conference on Quality of Multimedia Expe- rience (QoMEX)

Dodge, S., Karam, L.: Understanding how image quality affects deep neural net- works. In: 2016 Eighth International Conference on Quality of Multimedia Expe- rience (QoMEX). pp. 1–6 (2016)

work page 2016
[6]

In: 2017 26th international conference on computer communication and networks (ICCCN)

Dodge, S., Karam, L.: A study and comparison of human and deep learning recog- nition performance under visual distortions. In: 2017 26th international conference on computer communication and networks (ICCCN). pp. 1–7. IEEE (2017)

work page 2017
[7]

In: International Conference on Learning Representations (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

work page 2021
[8]

Advances in neural informa- tion processing systems31(2018)

Geirhos, R., Temme, C.R., Rauber, J., Schütt, H.H., Bethge, M., Wichmann, F.A.: Generalisation in humans and deep neural networks. Advances in neural informa- tion processing systems31(2018)

work page 2018
[9]

Advances in neural information processing systems33, 21271–21284 (2020)

Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Do- ersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems33, 21271–21284 (2020)

work page 2020
[10]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

work page 2022
[11]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016
[12]

In: International Conference on Learning Represen- tations (2019)

Hendrycks,D.,Dietterich,T.:Benchmarkingneuralnetworkrobustnesstocommon corruptions and perturbations. In: International Conference on Learning Represen- tations (2019)

work page 2019
[13]

In: International conference on machine learning

Hendrycks, D., Lee, K., Mazeika, M.: Using pre-training can improve model ro- bustness and uncertainty. In: International conference on machine learning. pp. 2712–2721. PMLR (2019)

work page 2019
[14]

Advances in neural information processing systems32(2019)

Hendrycks, D., Mazeika, M., Kadavath, S., Song, D.: Using self-supervised learning can improve model robustness and uncertainty. Advances in neural information processing systems32(2019)

work page 2019
[15]

In: International Conference on Learning Representations (2020)

Hendrycks*, D., Mu*, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: A simple method to improve robustness and uncertainty under data shift. In: International Conference on Learning Representations (2020)

work page 2020
[16]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

work page internal anchor Pith review arXiv 2015
[17]

Advances in neural information processing systems35, 23593–23606 (2022)

Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. Advances in neural information processing systems35, 23593–23606 (2022)

work page 2022
[18]

Frontiers in Computer Science2, 5 (2020)

Krull, A., Vičar, T., Prakash, M., Lalit, M., Jug, F.: Probabilistic noise2void: Un- supervised content-aware denoising. Frontiers in Computer Science2, 5 (2020)

work page 2020
[19]

In: International Conference on Machine Learning

Lehtinen, J., Munkberg, J., Hasselgren, J., Laine, S., Karras, T., Aittala, M., Aila, T.: Noise2noise: Learning image restoration without clean data. In: International Conference on Machine Learning. pp. 2965–2974 (2018) 16 K. Alexis et al

work page 2018
[20]

In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition

Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. pp. 12042–12051 (2022)

work page 2022
[21]

Transactions on Machine Learning Research (2024)

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual feat...

work page 2024
[22]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3967–3976 (2019)

work page 2019
[23]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763 (2021)

work page 2021
[24]

Ravula, S., Smyrnis, G., Jordan, M., Dimakis, A.G.: Inverse problems leveraging pre-trainedcontrastiverepresentations.AdvancesinNeuralInformationProcessing Systems34, 8753–8765 (2021)

work page 2021
[25]

In: 3rd International Conference on Learning Represen- tations, ICLR 2015 (2015)

Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: 3rd International Conference on Learning Represen- tations, ICLR 2015 (2015)

work page 2015
[26]

International journal of computer vision115(3), 211–252 (2015)

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International journal of computer vision115(3), 211–252 (2015)

work page 2015
[27]

In: Ad- vances in Neural Information Processing Systems

Schneider,S.,Rusak,E.,Eck,L.,Bringmann,O.,Brendel,W.,Bethge,M.:Improv- ing robustness against common corruptions by covariate shift adaptation. In: Ad- vances in Neural Information Processing Systems. vol. 33, pp. 11539–11551 (2020)

work page 2020
[28]

In: Inter- national Conference on Learning Representations (2020)

Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: Inter- national Conference on Learning Representations (2020)

work page 2020
[29]

In: International conference on machine learning

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347–10357 (2021)

work page 2021
[30]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10687–10698 (2020)

work page 2020
[31]

In: Interna- tional Conference on Learning Representations (2017)

Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: Interna- tional Conference on Learning Representations (2017)

work page 2017
[32]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5728–5739 (2022)

work page 2022
[33]

In: International Conference on Learning Representations (2018)

Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018)

work page 2018
[34]

IEEE Transactions on Image Processing27(9), 4608–4622 (2018)

Zhang, K., Zuo, W., Zhang, L.: Ffdnet: Toward a fast and flexible solution for cnn- based image denoising. IEEE Transactions on Image Processing27(9), 4608–4622 (2018)

work page 2018

[1] [1]

Image processing on line1, 208–212 (2011)

Buades, A., Coll, B., Morel, J.M.: Non-local means denoising. Image processing on line1, 208–212 (2011)

work page 2011

[2] [2]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021) Distilling Vision Transformers for Distortion Robustness 15

work page 2021

[3] [3]

In: International conference on machine learning

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607 (2020)

work page 2020

[4] [4]

Chen,X.,He,K.:Exploringsimplesiameserepresentationlearning.In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021)

work page 2021

[5] [5]

In: 2016 Eighth International Conference on Quality of Multimedia Expe- rience (QoMEX)

Dodge, S., Karam, L.: Understanding how image quality affects deep neural net- works. In: 2016 Eighth International Conference on Quality of Multimedia Expe- rience (QoMEX). pp. 1–6 (2016)

work page 2016

[6] [6]

In: 2017 26th international conference on computer communication and networks (ICCCN)

Dodge, S., Karam, L.: A study and comparison of human and deep learning recog- nition performance under visual distortions. In: 2017 26th international conference on computer communication and networks (ICCCN). pp. 1–7. IEEE (2017)

work page 2017

[7] [7]

In: International Conference on Learning Representations (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

work page 2021

[8] [8]

Advances in neural informa- tion processing systems31(2018)

Geirhos, R., Temme, C.R., Rauber, J., Schütt, H.H., Bethge, M., Wichmann, F.A.: Generalisation in humans and deep neural networks. Advances in neural informa- tion processing systems31(2018)

work page 2018

[9] [9]

Advances in neural information processing systems33, 21271–21284 (2020)

Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Do- ersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems33, 21271–21284 (2020)

work page 2020

[10] [10]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

work page 2022

[11] [11]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016

[12] [12]

In: International Conference on Learning Represen- tations (2019)

Hendrycks,D.,Dietterich,T.:Benchmarkingneuralnetworkrobustnesstocommon corruptions and perturbations. In: International Conference on Learning Represen- tations (2019)

work page 2019

[13] [13]

In: International conference on machine learning

Hendrycks, D., Lee, K., Mazeika, M.: Using pre-training can improve model ro- bustness and uncertainty. In: International conference on machine learning. pp. 2712–2721. PMLR (2019)

work page 2019

[14] [14]

Advances in neural information processing systems32(2019)

Hendrycks, D., Mazeika, M., Kadavath, S., Song, D.: Using self-supervised learning can improve model robustness and uncertainty. Advances in neural information processing systems32(2019)

work page 2019

[15] [15]

In: International Conference on Learning Representations (2020)

Hendrycks*, D., Mu*, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: A simple method to improve robustness and uncertainty under data shift. In: International Conference on Learning Representations (2020)

work page 2020

[16] [16]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

work page internal anchor Pith review arXiv 2015

[17] [17]

Advances in neural information processing systems35, 23593–23606 (2022)

Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. Advances in neural information processing systems35, 23593–23606 (2022)

work page 2022

[18] [18]

Frontiers in Computer Science2, 5 (2020)

Krull, A., Vičar, T., Prakash, M., Lalit, M., Jug, F.: Probabilistic noise2void: Un- supervised content-aware denoising. Frontiers in Computer Science2, 5 (2020)

work page 2020

[19] [19]

In: International Conference on Machine Learning

Lehtinen, J., Munkberg, J., Hasselgren, J., Laine, S., Karras, T., Aittala, M., Aila, T.: Noise2noise: Learning image restoration without clean data. In: International Conference on Machine Learning. pp. 2965–2974 (2018) 16 K. Alexis et al

work page 2018

[20] [20]

In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition

Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. pp. 12042–12051 (2022)

work page 2022

[21] [21]

Transactions on Machine Learning Research (2024)

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual feat...

work page 2024

[22] [22]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3967–3976 (2019)

work page 2019

[23] [23]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763 (2021)

work page 2021

[24] [24]

Ravula, S., Smyrnis, G., Jordan, M., Dimakis, A.G.: Inverse problems leveraging pre-trainedcontrastiverepresentations.AdvancesinNeuralInformationProcessing Systems34, 8753–8765 (2021)

work page 2021

[25] [25]

In: 3rd International Conference on Learning Represen- tations, ICLR 2015 (2015)

Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: 3rd International Conference on Learning Represen- tations, ICLR 2015 (2015)

work page 2015

[26] [26]

International journal of computer vision115(3), 211–252 (2015)

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International journal of computer vision115(3), 211–252 (2015)

work page 2015

[27] [27]

In: Ad- vances in Neural Information Processing Systems

Schneider,S.,Rusak,E.,Eck,L.,Bringmann,O.,Brendel,W.,Bethge,M.:Improv- ing robustness against common corruptions by covariate shift adaptation. In: Ad- vances in Neural Information Processing Systems. vol. 33, pp. 11539–11551 (2020)

work page 2020

[28] [28]

In: Inter- national Conference on Learning Representations (2020)

Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: Inter- national Conference on Learning Representations (2020)

work page 2020

[29] [29]

In: International conference on machine learning

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347–10357 (2021)

work page 2021

[30] [30]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10687–10698 (2020)

work page 2020

[31] [31]

In: Interna- tional Conference on Learning Representations (2017)

Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: Interna- tional Conference on Learning Representations (2017)

work page 2017

[32] [32]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5728–5739 (2022)

work page 2022

[33] [33]

In: International Conference on Learning Representations (2018)

Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018)

work page 2018

[34] [34]

IEEE Transactions on Image Processing27(9), 4608–4622 (2018)

Zhang, K., Zuo, W., Zhang, L.: Ffdnet: Toward a fast and flexible solution for cnn- based image denoising. IEEE Transactions on Image Processing27(9), 4608–4622 (2018)

work page 2018