Distilling Vision Transformers for Distortion-Robust Representation Learning
Pith reviewed 2026-05-08 12:20 UTC · model grok-4.3
The pith
A student Vision Transformer approximates clean-image representations from distorted inputs alone through multi-level distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In an asymmetric knowledge distillation framework, both teacher and student begin from the same pretrained Vision Transformer. The teacher receives clean images while the student receives their distorted counterparts. Multi-level distillation aligns the global embeddings, the patch-level features, and the attention maps, enabling the student to approximate the representations that the teacher would produce on clean data.
What carries the argument
Asymmetric multi-level knowledge distillation that aligns global embeddings, patch features, and attention maps between a clean teacher and a distorted student.
Load-bearing premise
Aligning global embeddings, patch features, and attention maps at multiple levels transfers the clean representations without needing extra regularization or architecture changes.
What would settle it
Compute the cosine similarity between the student's embedding on a distorted image and the teacher's embedding on the matching clean image; if this similarity fails to rise above baseline levels after training, the central claim does not hold.
Figures
read the original abstract
Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this paper, we demonstrate that pretrained vision models can be leveraged to learn distortion-robust representations, which can then be effectively applied to downstream tasks operating on distorted observations. In particular, we propose an asymmetric knowledge distillation framework in which both teacher and student are initialized from the same pretrained Vision Transformer but receive different views of each image: the teacher processes clean images, while the student sees their distorted versions. We introduce multi-level distillation that aligns global embeddings, patch-level features, and attention maps and show that the student is able to approximate clean-image representations despite never directly accessing clean data. We evaluate our approach on image classification tasks across several datasets and under various distortions, consistently outperforming existing alternatives for the same amount of human supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an asymmetric knowledge distillation framework for Vision Transformers to learn distortion-robust representations. Both teacher and student are initialized from the same pretrained ViT; the teacher receives clean images while the student receives distorted versions. Multi-level distillation aligns global embeddings, patch-level features, and attention maps, with the central claim that the student approximates clean-image representations without ever accessing clean data and consistently outperforms existing alternatives on image classification under distortions for the same level of supervision.
Significance. If the empirical results hold with proper controls, the work offers a practical way to transfer robustness from clean pretrained models to distorted inputs via distillation, which could be useful in domains where clean observations are scarce. The approach builds directly on standard supervised and distillation losses without introducing new free parameters or architectural modifications, and the multi-level alignment idea is a natural extension of prior ViT distillation techniques.
major comments (2)
- [Abstract / Method] Abstract and method description: the claim that multi-level alignment of global embeddings, patch features, and attention maps is sufficient to transfer distortion robustness rests on the untested assumption that the student cannot satisfy the losses via distortion-specific compensations that match the teacher only on training distortions. No ablation or analysis is provided to rule out such shortcut solutions, which is load-bearing for the central claim that the student approximates clean representations.
- [Abstract] Abstract: the statement of 'consistent outperformance' and 'several datasets and various distortions' is presented without any quantitative numbers, error bars, ablation studies, or statistical significance tests, leaving the soundness of the empirical support unverifiable from the provided text.
minor comments (1)
- The manuscript would benefit from explicit statements of the exact distortion types, severity levels, and dataset splits used in the experiments to allow reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing our responses and indicating the revisions we will make to improve the paper.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: the claim that multi-level alignment of global embeddings, patch features, and attention maps is sufficient to transfer distortion robustness rests on the untested assumption that the student cannot satisfy the losses via distortion-specific compensations that match the teacher only on training distortions. No ablation or analysis is provided to rule out such shortcut solutions, which is load-bearing for the central claim that the student approximates clean representations.
Authors: We agree that ruling out shortcut solutions is important for substantiating the central claim. The multi-level distillation, especially attention map alignment, is designed to promote structural similarity in how the student processes inputs rather than allowing superficial compensations. Nevertheless, we acknowledge the lack of explicit analysis in the current version. In the revised manuscript, we will add an ablation that evaluates the student on distortion types held out from training and reports direct representation similarity (e.g., cosine similarity of embeddings) between the student on distorted inputs and the teacher on clean inputs, along with downstream task performance under these conditions. This will provide evidence that the learned representations generalize beyond training distortions. revision: yes
-
Referee: [Abstract] Abstract: the statement of 'consistent outperformance' and 'several datasets and various distortions' is presented without any quantitative numbers, error bars, ablation studies, or statistical significance tests, leaving the soundness of the empirical support unverifiable from the provided text.
Authors: The abstract is intentionally concise to highlight the main contributions. To improve verifiability, we will revise it to include specific quantitative highlights, such as the average accuracy improvement over the strongest baseline across the reported datasets and distortions. Full tables with error bars from multiple random seeds, ablation studies, and statistical significance tests are already included in the experimental results section and supplementary material; we will ensure clearer cross-references from the abstract and introduction. revision: yes
Circularity Check
No circularity: empirical distillation framework is self-contained
full rationale
The paper describes an empirical asymmetric knowledge distillation setup with multi-level alignment losses (global embeddings, patch features, attention maps) whose targets are supplied by an external pretrained teacher ViT on clean images. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed that reduces to fitted parameters or self-citations by construction. All components are standard supervised/distillation objectives evaluated on held-out distortions; the central claim rests on experimental outperformance rather than any tautological reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pretrained Vision Transformers produce high-quality representations on clean images
- ad hoc to paper Matching global embeddings, patch features, and attention maps is sufficient to transfer robustness
Reference graph
Works this paper leans on
-
[1]
Image processing on line1, 208–212 (2011)
Buades, A., Coll, B., Morel, J.M.: Non-local means denoising. Image processing on line1, 208–212 (2011)
work page 2011
-
[2]
In: Proceedings of the IEEE/CVF international conference on computer vision
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021) Distilling Vision Transformers for Distortion Robustness 15
work page 2021
-
[3]
In: International conference on machine learning
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607 (2020)
work page 2020
-
[4]
Chen,X.,He,K.:Exploringsimplesiameserepresentationlearning.In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021)
work page 2021
-
[5]
In: 2016 Eighth International Conference on Quality of Multimedia Expe- rience (QoMEX)
Dodge, S., Karam, L.: Understanding how image quality affects deep neural net- works. In: 2016 Eighth International Conference on Quality of Multimedia Expe- rience (QoMEX). pp. 1–6 (2016)
work page 2016
-
[6]
In: 2017 26th international conference on computer communication and networks (ICCCN)
Dodge, S., Karam, L.: A study and comparison of human and deep learning recog- nition performance under visual distortions. In: 2017 26th international conference on computer communication and networks (ICCCN). pp. 1–7. IEEE (2017)
work page 2017
-
[7]
In: International Conference on Learning Representations (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
work page 2021
-
[8]
Advances in neural informa- tion processing systems31(2018)
Geirhos, R., Temme, C.R., Rauber, J., Schütt, H.H., Bethge, M., Wichmann, F.A.: Generalisation in humans and deep neural networks. Advances in neural informa- tion processing systems31(2018)
work page 2018
-
[9]
Advances in neural information processing systems33, 21271–21284 (2020)
Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Do- ersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems33, 21271–21284 (2020)
work page 2020
-
[10]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)
work page 2022
-
[11]
He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
work page 2016
-
[12]
In: International Conference on Learning Represen- tations (2019)
Hendrycks,D.,Dietterich,T.:Benchmarkingneuralnetworkrobustnesstocommon corruptions and perturbations. In: International Conference on Learning Represen- tations (2019)
work page 2019
-
[13]
In: International conference on machine learning
Hendrycks, D., Lee, K., Mazeika, M.: Using pre-training can improve model ro- bustness and uncertainty. In: International conference on machine learning. pp. 2712–2721. PMLR (2019)
work page 2019
-
[14]
Advances in neural information processing systems32(2019)
Hendrycks, D., Mazeika, M., Kadavath, S., Song, D.: Using self-supervised learning can improve model robustness and uncertainty. Advances in neural information processing systems32(2019)
work page 2019
-
[15]
In: International Conference on Learning Representations (2020)
Hendrycks*, D., Mu*, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: A simple method to improve robustness and uncertainty under data shift. In: International Conference on Learning Representations (2020)
work page 2020
-
[16]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
work page internal anchor Pith review arXiv 2015
-
[17]
Advances in neural information processing systems35, 23593–23606 (2022)
Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. Advances in neural information processing systems35, 23593–23606 (2022)
work page 2022
-
[18]
Frontiers in Computer Science2, 5 (2020)
Krull, A., Vičar, T., Prakash, M., Lalit, M., Jug, F.: Probabilistic noise2void: Un- supervised content-aware denoising. Frontiers in Computer Science2, 5 (2020)
work page 2020
-
[19]
In: International Conference on Machine Learning
Lehtinen, J., Munkberg, J., Hasselgren, J., Laine, S., Karras, T., Aittala, M., Aila, T.: Noise2noise: Learning image restoration without clean data. In: International Conference on Machine Learning. pp. 2965–2974 (2018) 16 K. Alexis et al
work page 2018
-
[20]
In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition
Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. pp. 12042–12051 (2022)
work page 2022
-
[21]
Transactions on Machine Learning Research (2024)
Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual feat...
work page 2024
-
[22]
In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3967–3976 (2019)
work page 2019
-
[23]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763 (2021)
work page 2021
-
[24]
Ravula, S., Smyrnis, G., Jordan, M., Dimakis, A.G.: Inverse problems leveraging pre-trainedcontrastiverepresentations.AdvancesinNeuralInformationProcessing Systems34, 8753–8765 (2021)
work page 2021
-
[25]
In: 3rd International Conference on Learning Represen- tations, ICLR 2015 (2015)
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: 3rd International Conference on Learning Represen- tations, ICLR 2015 (2015)
work page 2015
-
[26]
International journal of computer vision115(3), 211–252 (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International journal of computer vision115(3), 211–252 (2015)
work page 2015
-
[27]
In: Ad- vances in Neural Information Processing Systems
Schneider,S.,Rusak,E.,Eck,L.,Bringmann,O.,Brendel,W.,Bethge,M.:Improv- ing robustness against common corruptions by covariate shift adaptation. In: Ad- vances in Neural Information Processing Systems. vol. 33, pp. 11539–11551 (2020)
work page 2020
-
[28]
In: Inter- national Conference on Learning Representations (2020)
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: Inter- national Conference on Learning Representations (2020)
work page 2020
-
[29]
In: International conference on machine learning
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347–10357 (2021)
work page 2021
-
[30]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10687–10698 (2020)
work page 2020
-
[31]
In: Interna- tional Conference on Learning Representations (2017)
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: Interna- tional Conference on Learning Representations (2017)
work page 2017
-
[32]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5728–5739 (2022)
work page 2022
-
[33]
In: International Conference on Learning Representations (2018)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018)
work page 2018
-
[34]
IEEE Transactions on Image Processing27(9), 4608–4622 (2018)
Zhang, K., Zuo, W., Zhang, L.: Ffdnet: Toward a fast and flexible solution for cnn- based image denoising. IEEE Transactions on Image Processing27(9), 4608–4622 (2018)
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.