pith. sign in

arxiv: 2606.07593 · v1 · pith:57KIWGXAnew · submitted 2026-05-28 · 💻 cs.CV · cs.AI

A Mechanistic Analysis of Adversarial Fine-tuning of Vision Transformers

Pith reviewed 2026-06-29 08:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords adversarial fine-tuningvision transformersimage corruptionsmodel robustnessattention mechanismssparse representationsmechanistic analysis
0
0 comments X

The pith

Adversarial fine-tuning on specific corruptions improves Vision Transformer performance only on matching corruption types and leaves sparse representations unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how adversarial fine-tuning affects Vision Transformers when trained on low- and high-frequency image corruptions. It finds that performance and model certainty rise for new examples of the same corruption types used in training. These gains fail to appear for entirely different corruption classes. Shifts appear in attention patterns and how knowledge develops layer by layer, yet the underlying sparse representations stay fundamentally the same.

Core claim

Adversarial fine-tuning on low-frequency and high-frequency image corruptions leads to improved performance and certainty on new instances of those same corruptions, but the improvements do not transfer to other corruption classes. Although visual attention and knowledge evolution change across layers, adversarial training produces no fundamental alterations to the sparse representations learned by the Vision Transformers.

What carries the argument

Mechanistic examination of attention mechanisms, internal representations, and knowledge evolution across layers to track effects of adversarial fine-tuning.

If this is right

  • Performance and certainty improve specifically on new instances of the corruptions used during fine-tuning.
  • Robustness gains do not extend to corruption classes absent from the fine-tuning data.
  • Visual attention and knowledge evolution shift across layers after adversarial training.
  • The sparse representations inside the Vision Transformer remain unchanged in their fundamental structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training on a wider mix of corruption types during fine-tuning may be needed to achieve broader robustness.
  • The unchanged sparse representations suggest that ViT architectures may require different training strategies to alter their core internal processing.
  • Similar mechanistic checks could be applied to other fine-tuning methods or to multimodal models that incorporate ViTs.

Load-bearing premise

That the tracked changes in attention, representations, and layer-wise knowledge evolution are the factors that determine whether robustness transfers between corruption types.

What would settle it

Finding that fine-tuning on one set of corruptions raises accuracy on a completely different, unseen corruption class would contradict the non-transfer result.

Figures

Figures reproduced from arXiv: 2606.07593 by Dylan Hadfield-Menell (Massachusetts Institute of Technology), Hannah Gao (Massachusetts Institute of Technology), Isha Agarwal (Massachusetts Institute of Technology), Rachel Ma (Massachusetts Institute of Technology).

Figure 1
Figure 1. Figure 1: Overview of methods. ViT is adversarially trained on [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Top-1-accuracy, top-5-accuracy, and top-10 accuracy of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Probability of predicting the correct class for base and [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The average first layer the model predicts the cor [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: The average difference in attention entropy between an [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of cosine similarities of vanilla SAE activa [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of cosine similarities of BatchTopK SAE [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
read the original abstract

The widespread use of image classification models in high-risk, real-world situations necessitates making these models robust to slight disturbances or perturbations, such as blurring or sharpening, in the input images. While vision transformers (ViTs) play an integral role in many modern-day multi-modal models like Vision-Language-Models (VLMs) and Vision-Language-Action (VLA) models, they have received a lack of attention in the setting of robustness. In this work, we analyze the effects of adversarial fine-tuning, a popular method for improving model robustness to image perturbations, on a ViT's performance on perturbed and regular images through a mechanistic lens. We adversarially train a ViT on low-frequency and high-frequency image corruptions, and attempt to explain changes in downstream model performance through an examination of the model's attention mechanisms, internal representations, and knowledge evolution. Overall, our results suggest that, while fine-tuning on inputs with common corruptions improves model performance and certainty on new instances of corrupted data, these improvements do not transfer to other classes of corruptions not seen in the training. Additionally, despite observing changes in visual attention and knowledge evolution across layers, we found that adversarial training did not lead to fundamental changes in the sparse representations learned by ViTs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines adversarial fine-tuning of Vision Transformers (ViTs) on low-frequency and high-frequency image corruptions. It reports that such fine-tuning improves model performance and certainty on new instances of the trained corruptions but that these gains do not transfer to other corruption classes. Through mechanistic analysis of attention mechanisms, internal representations, and layer-wise knowledge evolution, the authors conclude that adversarial training produces observable changes in attention and knowledge evolution yet leaves the sparse representations learned by ViTs fundamentally unchanged.

Significance. If the experimental outcomes are reproducible, the work supplies concrete observational evidence on the specificity of corruption robustness in ViTs and on the stability of their sparse representations under adversarial fine-tuning. These findings are relevant to the robustness of ViTs inside larger multi-modal architectures and could guide the design of more generalizable robustness interventions.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'adversarial training did not lead to fundamental changes in the sparse representations' is stated without any description of the sparsity metric, the quantitative threshold for 'fundamental,' or the statistical test used to establish invariance; this measurement detail is load-bearing for the mechanistic conclusion.
  2. [Abstract] Abstract: the performance and transfer claims ('improves model performance and certainty on new instances... these improvements do not transfer') are presented without reference to specific datasets, corruption parameters, accuracy deltas, or controls for model capacity and training budget, preventing assessment of whether the non-transfer result is robust.
minor comments (2)
  1. [Abstract] The abstract uses the term 'adversarially train' without clarifying whether this refers to standard adversarial training (e.g., PGD) or a corruption-specific procedure; a brief definition would improve clarity.
  2. [Abstract] The phrase 'knowledge evolution across layers' is introduced without indicating the operational definition or visualization method employed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the abstract for greater precision and self-containment.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'adversarial training did not lead to fundamental changes in the sparse representations' is stated without any description of the sparsity metric, the quantitative threshold for 'fundamental,' or the statistical test used to establish invariance; this measurement detail is load-bearing for the mechanistic conclusion.

    Authors: The referee is correct that the abstract omits these measurement details. The main text defines the sparsity metric (activation sparsity via L0-norm on MLP outputs) and the invariance criterion (no statistically significant change via paired tests across layers). We will revise the abstract to include a concise reference to the metric and the invariance test. revision: yes

  2. Referee: [Abstract] Abstract: the performance and transfer claims ('improves model performance and certainty on new instances... these improvements do not transfer') are presented without reference to specific datasets, corruption parameters, accuracy deltas, or controls for model capacity and training budget, preventing assessment of whether the non-transfer result is robust.

    Authors: The referee correctly observes that the abstract is high-level. The full manuscript specifies the ImageNet dataset, low- and high-frequency corruptions from the Common Corruptions benchmark, accuracy deltas on seen vs. unseen corruptions, and controls for ViT size and training budget in the experimental setup. We will revise the abstract to reference the dataset family and report the key quantitative outcomes. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports observational results from controlled adversarial fine-tuning experiments on ViTs using frequency-based corruptions, measuring downstream performance, attention maps, layer-wise knowledge evolution, and representation sparsity. No equations, derivations, or first-principles claims appear; all central statements are presented as direct experimental outcomes without reduction to fitted parameters, self-definitions, or self-citation chains. The analysis is self-contained against external benchmarks and contains no load-bearing steps that collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No information available from the abstract to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5776 in / 1020 out tokens · 32327 ms · 2026-06-29T08:09:22.990278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references

  1. [1]

    A computer vision approach for autonomous cars to drive safe at construction zone, 2024

    Abu Shad Ahammed, Md Shahi Amran Hossain, and Roman Obermaisser. A computer vision approach for autonomous cars to drive safe at construction zone, 2024. 1

  2. [2]

    Batchtopk sparse autoencoders, 2024

    Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders, 2024. 3

  3. [3]

    Sparse autoencoders find highly interpretable features in language models, 2023

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. 2, 3

  4. [4]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 2, 3

  5. [5]

    An image is worth 16x16 words: Transformers for image recognition at scale, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 3

  6. [6]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. In The Thirteenth International Conference on Learning Rep- resentations, 2025. 3

  7. [7]

    C-lead: Con- trastive learning for enhanced adversarial defense, 2025

    Suklav Ghosh, Sonal Kumar, and Arijit Sur. C-lead: Con- trastive learning for enhanced adversarial defense, 2025. 2

  8. [8]

    Goodfellow, Jonathon Shlens, and Christian Szegedy

    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2015. 2

  9. [9]

    Benchmarking neu- ral network robustness to common corruptions and perturba- tions, 2019

    Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- tions, 2019. 2

  10. [10]

    Vision transformers don’t need trained registers, 2025

    Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandels- man. Vision transformers don’t need trained registers, 2025. 2

  11. [11]

    Laying the foundations for vision and multimodal mechanistic interpretability & open problems.https://www.alignmentforum

    Sonia Joseph and Neel Nanda. Laying the foundations for vision and multimodal mechanistic interpretability & open problems.https://www.alignmentforum. org/posts/kobJymvvcvhbjWFKe/laying- the- foundations - for - vision - and - multimodal - mechanistic, 2024. AI Alignment Forum post. 2

  12. [12]

    Openvla: An open- source vision-language-action model, 2024

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Fos- ter, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kol- lar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open- source vision-language-action model, 2024. 1

  13. [13]

    Towards deep learning models resistant to adversarial attacks, 2019

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks, 2019. 2

  14. [14]

    On the robustness of vision transformers to adversarial ex- amples, 2021

    Kaleel Mahmood, Rigel Mahmood, and Marten van Dijk. On the robustness of vision transformers to adversarial ex- amples, 2021. 2

  15. [15]

    Locating and editing factual associations in gpt,

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in gpt,

  16. [16]

    interpreting gpt: the logit lens

    nostalgebraist. interpreting gpt: the logit lens. https : / / www . lesswrong . com / posts / AcKRB8wDpdaN6v6ru / interpreting - gpt - the- logit- lens, 2020. LessWrong post, accessed

  17. [17]

    A closer look at robustness to l-infinity and spatial perturbations and their composition, 2022

    Luke Rowe, Benjamin Th ´erien, Krzysztof Czarnecki, and Hongyang Zhang. A closer look at robustness to l-infinity and spatial perturbations and their composition, 2022. 2 6

  18. [18]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- lenge.International Journal of Computer Vision (IJCV), 115 (3):211–252, 2015. 3

  19. [19]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization.International Journal of Com- puter Vision, 128(2):336–359, 2019. 2

  20. [20]

    Interpretable and testable vision features via sparse au- toencoders, 2025

    Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, and Yu Su. Interpretable and testable vision features via sparse au- toencoders, 2025. 2

  21. [21]

    Diffusion lens: Interpreting text encoders in text-to-image pipelines

    Michael Toker, Hadas Orgad, Mor Ventura, Dana Arad, and Yonatan Belinkov. Diffusion lens: Interpreting text encoders in text-to-image pipelines. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), page 9713–9728. Association for Computational Linguistics, 2024. 2

  22. [22]

    Analysing the robustness of vision-language-models to common cor- ruptions, 2025

    Muhammad Usama, Syeda Aishah Asim, Syed Bilal Ali, Syed Talal Wasim, and Umair Bin Mansoor. Analysing the robustness of vision-language-models to common cor- ruptions, 2025. 2

  23. [23]

    Interpretability in the wild: a circuit for indirect object identification in gpt-2 small,

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small,

  24. [24]

    A survey on the robustness of computer vision models against common corruptions, 2024

    Shunxin Wang, Raymond Veldhuis, Christoph Brune, and Nicola Strisciuglio. A survey on the robustness of computer vision models against common corruptions, 2024. 2

  25. [25]

    Visual transform- ers: Token-based image representation and processing for computer vision, 2020

    Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. Visual transform- ers: Token-based image representation and processing for computer vision, 2020. 3

  26. [26]

    Med3dvlm: An efficient vision-language model for 3d med- ical image analysis, 2025

    Yu Xin, Gorkem Can Ates, Kuang Gong, and Wei Shao. Med3dvlm: An efficient vision-language model for 3d med- ical image analysis, 2025. 1

  27. [27]

    Investigating the catastrophic for- getting in multimodal large language models, 2023

    Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic for- getting in multimodal large language models, 2023. 1 7