pith. sign in

arxiv: 2603.13652 · v2 · pith:ZK24QDZMnew · submitted 2026-03-13 · 💻 cs.CV

Causal Attribution via Activation Patching

Pith reviewed 2026-05-21 11:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords activation patchingcausal attributionvision transformersmodel interpretabilityexplainable AIimage attributionViT explanations
0
0 comments X

The pith

Causal attribution via activation patching provides a direct measure of each patch's influence on Vision Transformer predictions by intervening on internal representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a method to attribute predictions in Vision Transformers to specific image patches using causal interventions on model activations. Rather than using gradients or attention maps, it patches activations from a source image into a neutral target image's context at intermediate layers and observes the effect on the output score. This is meant to capture the semantic contribution of patches after some processing has occurred but before excessive mixing in later layers. Sympathetic readers would care if this leads to more accurate and localized explanations of what parts of an image drive the model's decision, which could aid in model debugging and trust in applications like object recognition.

Core claim

CAAP estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal contribution of patch-associated internal representations on the model's prediction.

What carries the argument

The activation patching intervention that inserts source-image activations into a neutral target context over an intermediate range of layers to isolate causal contributions of patch representations.

If this is right

  • Produces attribution maps that reflect the causal contribution of patch-associated internal representations.
  • Consistently outperforms existing methods across multiple ViT backbones and standard metrics.
  • Captures semantic evidence after initial representation formation.
  • Avoids late-layer global mixing that can reduce spatial specificity.
  • Yields more faithful and localized attributions in various settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending this patching approach to other transformer-based models could provide similar causal insights into token importance in natural language processing tasks.
  • Optimizing the selection of the neutral target context might further improve the precision of the attributions beyond what is demonstrated here.
  • The method's focus on intermediate layers suggests potential for hybrid attribution techniques that combine early and mid-layer interventions for even better localization.

Load-bearing premise

That inserting source-image activations into a neutral target context over an intermediate range of layers isolates the causal contribution of individual patch representations without introducing confounding effects from the choice of neutral context or layer range.

What would settle it

If attribution results vary significantly depending on which neutral target image is chosen or which specific intermediate layers are selected for patching, this would challenge the claim that the intervention reliably measures patch influence independent of those choices.

Figures

Figures reproduced from arXiv: 2603.13652 by Alireza Mirrokni, Amirmohammad Izadi, Faridoun Mehri, Hosein Hasani, Mahdieh Soleymani Baghshah, Mobin Bagherian, Mohammadali Banayeeanzade.

Figure 1
Figure 1. Figure 1: Overview of CAAP. Given a source image (top) and a blank target image (middle), we extract internal [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative attribution comparison between different methods for a representative ImageNet sample. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative attribution comparison in a representative image containing several objects using CLIP-L/14. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Target blank ablation on ImageNet across four ViT backbones. The type of target blank patches is varied [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Selection operator ablation on ImageNet across four ViT backbones. The spatial support of the selection [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean attention weights across layers for different region pairs: intra-object (blue), inter-object (orange), [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Intervention depth ablation on ImageNet using ViT backbones. For every cutoff point, we intervene on [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Representative insertion curves on ImageNet for ViT-L/16, CLIP-L/14, DINOv2-L/14, and DeiT3-L/16. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Representative deletion curves on ImageNet for ViT-L/16, CLIP-L/14, DINOv2-L/14, and DeiT3-L/16. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Radar-plot comparison of attribution performance over 10 representative ViT backbones. CAAP is [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p023_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p024_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p024_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p024_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p025_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p025_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p025_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p026_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p026_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p026_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p027_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p027_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p027_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p028_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p028_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p028_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p029_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p029_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p029_40.png] view at source ↗
read the original abstract

Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing attribution methods face several limitations, with gradient-based, relevance-propagation, and attention-based methods relying on local approximations, while perturbation or optimization-based methods intervene on inputs, tokens, or surrogates rather than internal patch representations. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers; methods that operate only on input changes, attention weights, or backward relevance signals may therefore provide indirect proxies for patch importance rather than directly testing the predictive effect of contextualized patch representations. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal contribution of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing semantic evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP consistently outperforms existing methods in various settings and produces more faithful and localized attributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Causal Attribution via Activation Patching (CAAP) for Vision Transformers. For each image patch, source activations are inserted into a neutral target context over an intermediate layer range; the resulting change in target-class score serves as the attribution signal. The authors argue this directly measures the causal contribution of contextualized patch representations after initial formation but before late-layer mixing, outperforming gradient, attention, relevance-propagation, and perturbation baselines in faithfulness and localization across multiple ViT backbones and standard metrics.

Significance. If the empirical claims hold after addressing the robustness issues below, CAAP would supply a more direct causal test of patch influence than input perturbations or backward signals, addressing a recognized gap in ViT interpretability where patch interactions across layers determine predictions.

major comments (1)
  1. [§3.2] §3.2 (Activation Patching Procedure): The central causal claim—that the score change isolates the predictive effect of the source patch activations—rests on the untested assumption that the neutral target context and chosen intermediate layer bounds introduce no systematic confounding correlated with source content. No ablation varying the neutral context construction (e.g., zero activations vs. random images vs. class-averaged patches) or layer range is reported, leaving open the possibility that reported gains in faithfulness are artifacts of the intervention design rather than evidence of superior causal measurement.
minor comments (2)
  1. [Abstract and §3.1] The abstract and §3.1 would benefit from an explicit one-sentence definition of the neutral target context (e.g., input image, activation tensor, or token sequence) before describing the patching operation.
  2. [Figure 2] Figure 2 caption should state the exact layer indices used for the intermediate range and the backbone variant shown, to allow direct replication of the visualized attribution maps.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on the activation patching procedure in detail below and will revise the paper accordingly to strengthen the causal claims.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Activation Patching Procedure): The central causal claim—that the score change isolates the predictive effect of the source patch activations—rests on the untested assumption that the neutral target context and chosen intermediate layer bounds introduce no systematic confounding correlated with source content. No ablation varying the neutral context construction (e.g., zero activations vs. random images vs. class-averaged patches) or layer range is reported, leaving open the possibility that reported gains in faithfulness are artifacts of the intervention design rather than evidence of superior causal measurement.

    Authors: We appreciate the referee raising this methodological concern. The neutral target context in CAAP is constructed from mean activations over a held-out set of images (distinct from both source and evaluation sets) to provide a baseline without class-specific content, and the intermediate layer range is selected to intervene after initial patch embedding and self-attention mixing but prior to final global pooling and classification. While these choices follow conventions from activation patching literature in NLP and vision, we acknowledge that the absence of explicit ablations on alternative contexts (zero, random, or class-averaged) and layer bounds leaves room for potential confounding. To address this directly, we have run additional experiments ablating the neutral context construction and varying the layer range (early: layers 1-4, mid: 4-8, late: 8-12) across ViT-B/16 and ViT-L/16 backbones. The results demonstrate that CAAP maintains superior faithfulness (e.g., higher AOPC and lower MoRF scores) and localization metrics compared to baselines in all variants, with only minor quantitative shifts that do not alter the ranking. We will incorporate these ablations as a new subsection in §3.2, an extended results table, and discussion of design rationale in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in CAAP derivation

full rationale

The paper defines CAAP directly as an activation-patching intervention on intermediate layers of a ViT, with the attribution signal being the change in target-class score after inserting source activations into a neutral context. This procedure is self-contained as an empirical measurement technique rather than a derivation that reduces to fitted parameters, self-referential definitions, or load-bearing self-citations. The justification for preferring intermediate layers (capturing semantic evidence while avoiding late global mixing) follows from the stated architecture of ViTs and the intervention design itself, without invoking uniqueness theorems or prior author results that would collapse the claim. No equations or steps in the provided abstract reduce the output attribution map to an input by construction; the method's faithfulness claims rest on comparative evaluation against baselines, which is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard ViT architecture and the assumption that a neutral target context can be constructed without its own class-relevant signals interfering with the measurement.

axioms (1)
  • domain assumption Vision Transformers form class-relevant evidence through interactions between patch tokens across layers.
    Stated in the abstract as the key challenge that motivates the method.

pith-pipeline@v0.9.0 · 5819 in / 1111 out tokens · 39743 ms · 2026-05-21T11:05:48.686011+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How LLMs Are Persuaded: A Few Attention Heads, Rerouted

    cs.AI 2026-05 unverdicted novelty 7.0

    Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    Quantifying Attention Flow in Transformers

    Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020. doi: 10.18653/v1/2020.acl-main.385. URLhttps://aclanthology.org/2020.acl-main.385/

  2. [2]

    Grad-sam: Explaining transformers via gradient self-attention maps

    Oren Barkan, Edan Hauon, Avi Caciularu, Ori Katz, Itzik Malkiel, Omri Armstrong, and Noam Koenigstein. Grad-sam: Explaining transformers via gradient self-attention maps. InProceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM), pages 2882–2887, 2021. doi: 10.1145/3459637.3482126

  3. [3]

    What's the Point: Semantic Segmentation with Point Supervision

    Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision, 2016. URLhttps://arxiv.org/abs/1506.02106

  4. [4]

    Food-101 – mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision, 2014

  5. [5]

    Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

    Aditya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. Grad-cam++: Improved visual explanations for deep convolutional networks, 2017. URL https://arxiv.org/abs/ 1710.11063

  6. [6]

    Balasubramanian

    Aditya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. Neural network attributions: A causal perspective. InProceedings of ICML, 2019. URL https://proceedings.mlr. press/v97/chattopadhyay19a.html

  7. [7]

    Transformer interpretability beyond attention visualization

    Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 782–791, 2021

  8. [8]

    Atman: Understanding transformer predictions through memory efficient attention manipulation.arXiv preprint arXiv:2301.08110, 2023

    Mayukh Deb, Björn Deiseroth, Samuel Weinbach, Patrick Schramowski, and Kristian Kersting. Atman: Understanding transformer predictions through memory efficient attention manipulation.arXiv preprint arXiv:2301.08110, 2023. URLhttps://arxiv.org/abs/2301.08110

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URL https://arxiv.org/abs/2010.11929

  10. [10]

    Explaining through transformer input sam- pling

    Alexandre Englebert, Sédrick Stassin, Géraldin Nanfack, Sidi Ahmed Mahmoudi, Xavier Siebert, Olivier Cornu, and Christophe De Vleeschouwer. Explaining through transformer input sam- pling. InProceedings of the IEEE/CVF International Conference on Computer Vision Work- shops (ICCVW), 2023. URL https://openaccess.thecvf.com/content/ICCV2023W/NIVT/html/ Engl...

  11. [11]

    Fong and Andrea Vedaldi

    Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. InProceedings of ICCV, 2017. URL https://www.robots.ox.ac.uk/~vgg/publications/2017/ Fong17/

  12. [12]

    Fong, Mandela Patrick, and Andrea Vedaldi

    Ruth C. Fong, Mandela Patrick, and Andrea Vedaldi. Understanding deep networks via extremal perturbations and smooth masks. InProceedings of ICCV, 2019

  13. [13]

    Large- scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2023

    Shanghua Gao, Zhong-Yu Li, Ming-Hsuan Yang, Ming-Ming Cheng, Junwei Han, and Philip Torr. Large- scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2023. doi: 10.1109/TPAMI.2022.3218275. URL https://doi.org/10.1109/TPAMI. 2022.3218275

  14. [14]

    Zhang, Shaoqing Ren, and Jian Sun

    Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015. URL https://api.semanticscholar.org/CorpusID:206594692

  15. [15]

    Weinberger

    Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2016. URL https://api.semanticscholar.org/CorpusID:9433631

  16. [16]

    Sarthak Jain and Byron C. Wallace. Attention is not explanation. InProceedings of NAACL, 2019. URL https://aclanthology.org/N19-1357/. 11

  17. [17]

    Attention is not only a weight: Analyzing transformers with vector norms

    Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not only a weight: Analyzing transformers with vector norms. InProceedings of EMNLP, pages 7057–7075, 2020

  18. [18]

    Multiplex network-based rep- resentation of vision transformers for visual explainability.Neural Computing and Applications, 37 (29):24385–24420, 2025

    Michele Marchetti, Davide Traini, Domenico Ursino, and Luca Virgili. Multiplex network-based rep- resentation of vision transformers for visual explainability.Neural Computing and Applications, 37 (29):24385–24420, 2025. doi: 10.1007/s00521-025-11591-x. URL https://doi.org/10.1007/ s00521-025-11591-x

  19. [19]

    How to evaluate foreground maps? InCVPR, 2014

    Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal. How to evaluate foreground maps? InCVPR, 2014

  20. [20]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp

    Faridoun Mehri, Mohsen Fayyaz, Mahdieh Soleymani Baghshah, and Mohammad Taher Pile- hvar. SkipPLUS: Skip the first few layers to better explain vision transformers. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 204–215, June 2024. doi: 10.1109/CVPRW63382.2024.00025. URL https://openaccess.the...

  21. [21]

    Libragrad: Balancing gradient flow for universally better vision transformer attributions

    Faridoun Mehri, Mahdieh Soleymani Baghshah, and Mohammad Taher Pilehvar. Libragrad: Balancing gradient flow for universally better vision transformer attributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 67–78, June 2025

  22. [22]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

  23. [23]

    Dissecting query-key interaction in vision transformers

    Xu Pan, Aaron Philip, Ziqian Xie, and Odelia Schwartz. Dissecting query-key interaction in vision transformers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=dIktpSgK4F

  24. [24]

    Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

    Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012

  25. [25]

    Rise: Randomized input sampling for explanation of black-box models

    Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models. InBMVC, 2018

  26. [26]

    Attcat: Explaining transformers via attentive class activation tokens

    Yao Qiang, Deng Pan, Chengyin Li, Xin Li, Rhongho Jang, and Dongxiao Zhu. Attcat: Explaining transformers via attentive class activation tokens. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  27. [27]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

  28. [28]

    ImageNet Large Scale Visual Recognition Challenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge, 2015. URLhttps://arxiv.org/abs/1409.0575

  29. [29]

    Anders, and Klaus-Robert Müller

    Wojciech Samek, Grégoire Montavon, Sebastian Lapuschkin, Christopher J. Anders, and Klaus-Robert Müller. Explaining deep neural networks and beyond: A review of methods and applications.Proceedings of the IEEE, 109(3):247–278, 2021

  30. [30]

    doi:10.1007/s11263-019-01228-7

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization.Inter- national Journal of Computer Vision, 128(2):336–359, 2019. doi: 10.1007/s11263-019-01228-7. URL https://doi.org/10.1007/s11263-019-01228-7

  31. [31]

    Not Just a Black Box: Learning Important Features Through Propagating Activation Differences

    Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: Learning important features through propagating activation differences, 2016. URL https://arxiv.org/abs/ 1605.01713

  32. [32]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. URLhttps://api.semanticscholar.org/CorpusID:14124313. 12

  33. [33]

    SmoothGrad: removing noise by adding noise

    Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise, 2017. URLhttps://arxiv.org/abs/1706.03825

  34. [34]

    How to train your vit? data, augmentation, and regularization in vision transformers,

    Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers, 2021. URL https://arxiv.org/abs/2106.10270

  35. [35]

    Axiomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 3319–3328. PMLR, 2017

  36. [36]

    Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification,

    Mohammad Reza Taesiri, Giang Nguyen, Sarra Habchi, Cor-Paul Bezemer, and Anh Nguyen. Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification,

  37. [37]

    URLhttps://arxiv.org/abs/2304.05538

  38. [38]

    Deit iii: Revenge of the vit, 2022

    Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit, 2022. URL https: //arxiv.org/abs/2204.07118

  39. [39]

    Metric-driven attributions for vision transformers

    Chase Walker, Sumit Jha, and Rickard Ewetz. Metric-driven attributions for vision transformers. In International Conference on Learning Representations (ICLR), 2025. URL https://proceedings.iclr. cc/paper_files/paper/2025/file/4e21153e79aff242492146d78d09fcdb-Paper-Conference. pdf

  40. [40]

    Score-cam: Score-weighted visual explanations for convolutional neural networks

    Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020

  41. [41]

    Weiyan Xie, Xiao-Hui Li, Caleb Chen Cao, and Nevin L. Zhang. Vit-cx: Causal explanation of vision transformers, 2023. URLhttps://arxiv.org/abs/2211.03064

  42. [42]

    Explaining information flow inside vision transformers using markov chain

    Tingyi Yuan, Xuhong Li, Haoyi Xiong, Hui Cao, and Dejing Dou. Explaining information flow inside vision transformers using markov chain. InNeurIPS 2021 Workshop on eXplainable AI Approaches for Debugging and Diagnosis (XAI4Debugging), 2021. URLhttps://openreview.net/forum?id=TT-cf6QSDaQ

  43. [43]

    Towards best practices of activation patching in language models: Metrics and methods

    Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods. InInternational Conference on Learning Representations (ICLR), 2024. URL https: //openreview.net/forum?id=Hf17y6u9BC. 13 A Broader Experiments In this section, we present additional experimental results that further expand the empirical evalua...