pith. sign in

arxiv: 2604.22529 · v1 · submitted 2026-04-24 · 💻 cs.CV

Distilling Vision Transformers for Distortion-Robust Representation Learning

Pith reviewed 2026-05-08 12:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision transformersknowledge distillationdistortion robustnessrepresentation learningimage classificationself-supervised learningrobust features
0
0 comments X

The pith

A student Vision Transformer approximates clean-image representations from distorted inputs alone through multi-level distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a student Vision Transformer can produce representations close to those of clean images even when it only ever receives distorted versions of the same images. This happens in an asymmetric setup where a teacher model processes clean images and knowledge is transferred by aligning global embeddings, patch-level features, and attention maps. A reader would care because clean data is often scarce or unavailable, yet many real-world tasks involve distorted observations such as noisy or compressed images. The student is initialized from the same pretrained Vision Transformer as the teacher and, after this process, outperforms prior methods on image classification under distortions when using the same amount of labeled supervision.

Core claim

In an asymmetric knowledge distillation framework, both teacher and student begin from the same pretrained Vision Transformer. The teacher receives clean images while the student receives their distorted counterparts. Multi-level distillation aligns the global embeddings, the patch-level features, and the attention maps, enabling the student to approximate the representations that the teacher would produce on clean data.

What carries the argument

Asymmetric multi-level knowledge distillation that aligns global embeddings, patch features, and attention maps between a clean teacher and a distorted student.

Load-bearing premise

Aligning global embeddings, patch features, and attention maps at multiple levels transfers the clean representations without needing extra regularization or architecture changes.

What would settle it

Compute the cosine similarity between the student's embedding on a distorted image and the teacher's embedding on the matching clean image; if this similarity fails to rise above baseline levels after training, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.22529 by Dimitrios Gunopulos, Giorgos Giannopoulos, Konstantinos Alexis.

Figure 1
Figure 1. Figure 1: Overview of our multi-target distillation framework. A frozen teacher ViT pro￾cesses clean images while a student ViT of identical architecture receives their distorted counterparts, being trained to align with the teacher at three complementary levels: global embeddings, patch-level features, and attention maps. In this way, the student learns to recover clean semantic representations directly from corrup… view at source ↗
Figure 2
Figure 2. Figure 2: Robustness under increasing distortions. Top-1 classification accuracy (%) on the ImageNet-100 validation set as the distortion intensity exceeds the levels used during training. Models are fine-tuned on random masking (90%), Gaussian noise (σ = 0.5), and Gaussian blur (kernel size 37), respectively. leads to representations that remain more stable and informative under severe input perturbations. 4.8 Labe… view at source ↗
Figure 3
Figure 3. Figure 3: Label-efficient performance under distortions. Top-1 classification accuracy (%) on ImageNet-100 validation images for varying fractions of training labels (log-scale x￾axis) view at source ↗
Figure 4
Figure 4. Figure 4: Attention maps of our distilled student models and the supervised baselines on ImageNet-100 validation images, under different distortion types. For each example, we show the original image, the distorted input, and the attention maps predicted by the supervised baseline and the encoder trained by our method. 4.9 Attention Map Analysis view at source ↗
read the original abstract

Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this paper, we demonstrate that pretrained vision models can be leveraged to learn distortion-robust representations, which can then be effectively applied to downstream tasks operating on distorted observations. In particular, we propose an asymmetric knowledge distillation framework in which both teacher and student are initialized from the same pretrained Vision Transformer but receive different views of each image: the teacher processes clean images, while the student sees their distorted versions. We introduce multi-level distillation that aligns global embeddings, patch-level features, and attention maps and show that the student is able to approximate clean-image representations despite never directly accessing clean data. We evaluate our approach on image classification tasks across several datasets and under various distortions, consistently outperforming existing alternatives for the same amount of human supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an asymmetric knowledge distillation framework for Vision Transformers to learn distortion-robust representations. Both teacher and student are initialized from the same pretrained ViT; the teacher receives clean images while the student receives distorted versions. Multi-level distillation aligns global embeddings, patch-level features, and attention maps, with the central claim that the student approximates clean-image representations without ever accessing clean data and consistently outperforms existing alternatives on image classification under distortions for the same level of supervision.

Significance. If the empirical results hold with proper controls, the work offers a practical way to transfer robustness from clean pretrained models to distorted inputs via distillation, which could be useful in domains where clean observations are scarce. The approach builds directly on standard supervised and distillation losses without introducing new free parameters or architectural modifications, and the multi-level alignment idea is a natural extension of prior ViT distillation techniques.

major comments (2)
  1. [Abstract / Method] Abstract and method description: the claim that multi-level alignment of global embeddings, patch features, and attention maps is sufficient to transfer distortion robustness rests on the untested assumption that the student cannot satisfy the losses via distortion-specific compensations that match the teacher only on training distortions. No ablation or analysis is provided to rule out such shortcut solutions, which is load-bearing for the central claim that the student approximates clean representations.
  2. [Abstract] Abstract: the statement of 'consistent outperformance' and 'several datasets and various distortions' is presented without any quantitative numbers, error bars, ablation studies, or statistical significance tests, leaving the soundness of the empirical support unverifiable from the provided text.
minor comments (1)
  1. The manuscript would benefit from explicit statements of the exact distortion types, severity levels, and dataset splits used in the experiments to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing our responses and indicating the revisions we will make to improve the paper.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: the claim that multi-level alignment of global embeddings, patch features, and attention maps is sufficient to transfer distortion robustness rests on the untested assumption that the student cannot satisfy the losses via distortion-specific compensations that match the teacher only on training distortions. No ablation or analysis is provided to rule out such shortcut solutions, which is load-bearing for the central claim that the student approximates clean representations.

    Authors: We agree that ruling out shortcut solutions is important for substantiating the central claim. The multi-level distillation, especially attention map alignment, is designed to promote structural similarity in how the student processes inputs rather than allowing superficial compensations. Nevertheless, we acknowledge the lack of explicit analysis in the current version. In the revised manuscript, we will add an ablation that evaluates the student on distortion types held out from training and reports direct representation similarity (e.g., cosine similarity of embeddings) between the student on distorted inputs and the teacher on clean inputs, along with downstream task performance under these conditions. This will provide evidence that the learned representations generalize beyond training distortions. revision: yes

  2. Referee: [Abstract] Abstract: the statement of 'consistent outperformance' and 'several datasets and various distortions' is presented without any quantitative numbers, error bars, ablation studies, or statistical significance tests, leaving the soundness of the empirical support unverifiable from the provided text.

    Authors: The abstract is intentionally concise to highlight the main contributions. To improve verifiability, we will revise it to include specific quantitative highlights, such as the average accuracy improvement over the strongest baseline across the reported datasets and distortions. Full tables with error bars from multiple random seeds, ablation studies, and statistical significance tests are already included in the experimental results section and supplementary material; we will ensure clearer cross-references from the abstract and introduction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical distillation framework is self-contained

full rationale

The paper describes an empirical asymmetric knowledge distillation setup with multi-level alignment losses (global embeddings, patch features, attention maps) whose targets are supplied by an external pretrained teacher ViT on clean images. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed that reduces to fitted parameters or self-citations by construction. All components are standard supervised/distillation objectives evaluated on held-out distortions; the central claim rests on experimental outperformance rather than any tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that a pretrained ViT already encodes useful clean representations and that matching at three feature levels transfers robustness; no new entities are postulated and no parameters are fitted beyond standard training hyperparameters.

axioms (2)
  • domain assumption Pretrained Vision Transformers produce high-quality representations on clean images
    Invoked when the teacher is initialized from a pretrained ViT and its outputs are treated as targets.
  • ad hoc to paper Matching global embeddings, patch features, and attention maps is sufficient to transfer robustness
    Central design choice of the multi-level distillation objective.

pith-pipeline@v0.9.0 · 5450 in / 1332 out tokens · 31013 ms · 2026-05-08T12:20:52.262996+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    Image processing on line1, 208–212 (2011)

    Buades, A., Coll, B., Morel, J.M.: Non-local means denoising. Image processing on line1, 208–212 (2011)

  2. [2]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021) Distilling Vision Transformers for Distortion Robustness 15

  3. [3]

    In: International conference on machine learning

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607 (2020)

  4. [4]

    Chen,X.,He,K.:Exploringsimplesiameserepresentationlearning.In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021)

  5. [5]

    In: 2016 Eighth International Conference on Quality of Multimedia Expe- rience (QoMEX)

    Dodge, S., Karam, L.: Understanding how image quality affects deep neural net- works. In: 2016 Eighth International Conference on Quality of Multimedia Expe- rience (QoMEX). pp. 1–6 (2016)

  6. [6]

    In: 2017 26th international conference on computer communication and networks (ICCCN)

    Dodge, S., Karam, L.: A study and comparison of human and deep learning recog- nition performance under visual distortions. In: 2017 26th international conference on computer communication and networks (ICCCN). pp. 1–7. IEEE (2017)

  7. [7]

    In: International Conference on Learning Representations (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

  8. [8]

    Advances in neural informa- tion processing systems31(2018)

    Geirhos, R., Temme, C.R., Rauber, J., Schütt, H.H., Bethge, M., Wichmann, F.A.: Generalisation in humans and deep neural networks. Advances in neural informa- tion processing systems31(2018)

  9. [9]

    Advances in neural information processing systems33, 21271–21284 (2020)

    Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Do- ersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems33, 21271–21284 (2020)

  10. [10]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  11. [11]

    He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

  12. [12]

    In: International Conference on Learning Represen- tations (2019)

    Hendrycks,D.,Dietterich,T.:Benchmarkingneuralnetworkrobustnesstocommon corruptions and perturbations. In: International Conference on Learning Represen- tations (2019)

  13. [13]

    In: International conference on machine learning

    Hendrycks, D., Lee, K., Mazeika, M.: Using pre-training can improve model ro- bustness and uncertainty. In: International conference on machine learning. pp. 2712–2721. PMLR (2019)

  14. [14]

    Advances in neural information processing systems32(2019)

    Hendrycks, D., Mazeika, M., Kadavath, S., Song, D.: Using self-supervised learning can improve model robustness and uncertainty. Advances in neural information processing systems32(2019)

  15. [15]

    In: International Conference on Learning Representations (2020)

    Hendrycks*, D., Mu*, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: A simple method to improve robustness and uncertainty under data shift. In: International Conference on Learning Representations (2020)

  16. [16]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  17. [17]

    Advances in neural information processing systems35, 23593–23606 (2022)

    Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. Advances in neural information processing systems35, 23593–23606 (2022)

  18. [18]

    Frontiers in Computer Science2, 5 (2020)

    Krull, A., Vičar, T., Prakash, M., Lalit, M., Jug, F.: Probabilistic noise2void: Un- supervised content-aware denoising. Frontiers in Computer Science2, 5 (2020)

  19. [19]

    In: International Conference on Machine Learning

    Lehtinen, J., Munkberg, J., Hasselgren, J., Laine, S., Karras, T., Aittala, M., Aila, T.: Noise2noise: Learning image restoration without clean data. In: International Conference on Machine Learning. pp. 2965–2974 (2018) 16 K. Alexis et al

  20. [20]

    In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition

    Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. pp. 12042–12051 (2022)

  21. [21]

    Transactions on Machine Learning Research (2024)

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual feat...

  22. [22]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3967–3976 (2019)

  23. [23]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763 (2021)

  24. [24]

    Ravula, S., Smyrnis, G., Jordan, M., Dimakis, A.G.: Inverse problems leveraging pre-trainedcontrastiverepresentations.AdvancesinNeuralInformationProcessing Systems34, 8753–8765 (2021)

  25. [25]

    In: 3rd International Conference on Learning Represen- tations, ICLR 2015 (2015)

    Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: 3rd International Conference on Learning Represen- tations, ICLR 2015 (2015)

  26. [26]

    International journal of computer vision115(3), 211–252 (2015)

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International journal of computer vision115(3), 211–252 (2015)

  27. [27]

    In: Ad- vances in Neural Information Processing Systems

    Schneider,S.,Rusak,E.,Eck,L.,Bringmann,O.,Brendel,W.,Bethge,M.:Improv- ing robustness against common corruptions by covariate shift adaptation. In: Ad- vances in Neural Information Processing Systems. vol. 33, pp. 11539–11551 (2020)

  28. [28]

    In: Inter- national Conference on Learning Representations (2020)

    Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: Inter- national Conference on Learning Representations (2020)

  29. [29]

    In: International conference on machine learning

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347–10357 (2021)

  30. [30]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10687–10698 (2020)

  31. [31]

    In: Interna- tional Conference on Learning Representations (2017)

    Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: Interna- tional Conference on Learning Representations (2017)

  32. [32]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5728–5739 (2022)

  33. [33]

    In: International Conference on Learning Representations (2018)

    Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018)

  34. [34]

    IEEE Transactions on Image Processing27(9), 4608–4622 (2018)

    Zhang, K., Zuo, W., Zhang, L.: Ffdnet: Toward a fast and flexible solution for cnn- based image denoising. IEEE Transactions on Image Processing27(9), 4608–4622 (2018)