Causal Attribution via Activation Patching

Alireza Mirrokni; Amirmohammad Izadi; Faridoun Mehri; Hosein Hasani; Mahdieh Soleymani Baghshah; Mobin Bagherian; Mohammadali Banayeeanzade

arxiv: 2603.13652 · v2 · pith:ZK24QDZMnew · submitted 2026-03-13 · 💻 cs.CV

Causal Attribution via Activation Patching

Amirmohammad Izadi , Mohammadali Banayeeanzade , Alireza Mirrokni , Hosein Hasani , Mobin Bagherian , Faridoun Mehri , Mahdieh Soleymani Baghshah This is my paper

Pith reviewed 2026-05-21 11:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords activation patchingcausal attributionvision transformersmodel interpretabilityexplainable AIimage attributionViT explanations

0 comments

The pith

Causal attribution via activation patching provides a direct measure of each patch's influence on Vision Transformer predictions by intervening on internal representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a method to attribute predictions in Vision Transformers to specific image patches using causal interventions on model activations. Rather than using gradients or attention maps, it patches activations from a source image into a neutral target image's context at intermediate layers and observes the effect on the output score. This is meant to capture the semantic contribution of patches after some processing has occurred but before excessive mixing in later layers. Sympathetic readers would care if this leads to more accurate and localized explanations of what parts of an image drive the model's decision, which could aid in model debugging and trust in applications like object recognition.

Core claim

CAAP estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal contribution of patch-associated internal representations on the model's prediction.

What carries the argument

The activation patching intervention that inserts source-image activations into a neutral target context over an intermediate range of layers to isolate causal contributions of patch representations.

If this is right

Produces attribution maps that reflect the causal contribution of patch-associated internal representations.
Consistently outperforms existing methods across multiple ViT backbones and standard metrics.
Captures semantic evidence after initial representation formation.
Avoids late-layer global mixing that can reduce spatial specificity.
Yields more faithful and localized attributions in various settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this patching approach to other transformer-based models could provide similar causal insights into token importance in natural language processing tasks.
Optimizing the selection of the neutral target context might further improve the precision of the attributions beyond what is demonstrated here.
The method's focus on intermediate layers suggests potential for hybrid attribution techniques that combine early and mid-layer interventions for even better localization.

Load-bearing premise

That inserting source-image activations into a neutral target context over an intermediate range of layers isolates the causal contribution of individual patch representations without introducing confounding effects from the choice of neutral context or layer range.

What would settle it

If attribution results vary significantly depending on which neutral target image is chosen or which specific intermediate layers are selected for patching, this would challenge the claim that the intervention reliably measures patch influence independent of those choices.

Figures

Figures reproduced from arXiv: 2603.13652 by Alireza Mirrokni, Amirmohammad Izadi, Faridoun Mehri, Hosein Hasani, Mahdieh Soleymani Baghshah, Mobin Bagherian, Mohammadali Banayeeanzade.

**Figure 2.** Figure 2: Qualitative attribution comparison between different methods for a representative ImageNet sample. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative attribution comparison in a representative image containing several objects using CLIP-L/14. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Target blank ablation on ImageNet across four ViT backbones. The type of target blank patches is varied [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Selection operator ablation on ImageNet across four ViT backbones. The spatial support of the selection [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Mean attention weights across layers for different region pairs: intra-object (blue), inter-object (orange), [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Intervention depth ablation on ImageNet using ViT backbones. For every cutoff point, we intervene on [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Representative insertion curves on ImageNet for ViT-L/16, CLIP-L/14, DINOv2-L/14, and DeiT3-L/16. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Representative deletion curves on ImageNet for ViT-L/16, CLIP-L/14, DINOv2-L/14, and DeiT3-L/16. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Radar-plot comparison of attribution performance over 10 representative ViT backbones. CAAP is [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗

**Figure 20.** Figure 20: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗

**Figure 21.** Figure 21: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p023_21.png] view at source ↗

**Figure 22.** Figure 22: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗

**Figure 23.** Figure 23: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p024_23.png] view at source ↗

**Figure 24.** Figure 24: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p024_24.png] view at source ↗

**Figure 25.** Figure 25: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p024_25.png] view at source ↗

**Figure 26.** Figure 26: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p025_26.png] view at source ↗

**Figure 27.** Figure 27: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p025_27.png] view at source ↗

**Figure 28.** Figure 28: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p025_28.png] view at source ↗

**Figure 29.** Figure 29: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p026_29.png] view at source ↗

**Figure 30.** Figure 30: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p026_30.png] view at source ↗

**Figure 31.** Figure 31: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p026_31.png] view at source ↗

**Figure 32.** Figure 32: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p027_32.png] view at source ↗

**Figure 33.** Figure 33: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p027_33.png] view at source ↗

**Figure 34.** Figure 34: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p027_34.png] view at source ↗

**Figure 35.** Figure 35: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p028_35.png] view at source ↗

**Figure 36.** Figure 36: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p028_36.png] view at source ↗

**Figure 37.** Figure 37: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p028_37.png] view at source ↗

**Figure 38.** Figure 38: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p029_38.png] view at source ↗

**Figure 39.** Figure 39: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p029_39.png] view at source ↗

**Figure 40.** Figure 40: Visualization of attribution maps produced by different methods across various models. The target [PITH_FULL_IMAGE:figures/full_fig_p029_40.png] view at source ↗

read the original abstract

Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing attribution methods face several limitations, with gradient-based, relevance-propagation, and attention-based methods relying on local approximations, while perturbation or optimization-based methods intervene on inputs, tokens, or surrogates rather than internal patch representations. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers; methods that operate only on input changes, attention weights, or backward relevance signals may therefore provide indirect proxies for patch importance rather than directly testing the predictive effect of contextualized patch representations. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal contribution of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing semantic evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP consistently outperforms existing methods in various settings and produces more faithful and localized attributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAAP offers a direct activation intervention for patch attribution in ViTs but its advantages depend on unexamined choices for the neutral context and layer range.

read the letter

The main takeaway is that this work proposes Causal Attribution via Activation Patching, or CAAP, as a way to attribute the influence of image patches in Vision Transformers. Instead of relying on gradients, attention maps, or input changes, they directly replace activations in the model with ones from a source image while keeping a neutral target context, and they do this over an intermediate set of layers. The change in the model's output score then serves as the attribution value for that patch. This approach is new in its application to internal patch representations after some processing has occurred. Prior methods either approximate importance through backpropagation or alter the input directly, which the authors say misses the interactions that build up the evidence for a class. By intervening at the activation level in the middle layers, the method aims to capture semantic contributions without the global mixing that occurs deeper in the network. The paper does well in clearly stating the limitations of existing families of attribution techniques and in framing the intervention as a causal test. The description of how class-relevant evidence forms through patch interactions across layers is straightforward and helps motivate why a direct intervention might be preferable. Where it gets softer is in the details of the neutral target and the layer range. The claim that this isolates the causal contribution rests on the neutral context not introducing its own effects that could correlate with the source patch. Without seeing how the neutral image or activations are chosen or whether they ran checks varying the context or the starting and ending layers, it's difficult to know if the better localization and faithfulness scores are robust or tied to a particular setup. The abstract mentions consistent outperformance across backbones and metrics, but the strength of that evidence depends on those unstated choices. Readers who work on explainable AI for transformers or who build tools for understanding ViT decisions would get the most from this. It could be useful for someone looking to experiment with internal interventions rather than surface-level perturbations. The work shows enough structure and a testable idea to merit peer review. I would suggest sending it to referees, though they will likely want more on the construction of the neutral context and ablations around the layer selection to strengthen the causal interpretation.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Causal Attribution via Activation Patching (CAAP) for Vision Transformers. For each image patch, source activations are inserted into a neutral target context over an intermediate layer range; the resulting change in target-class score serves as the attribution signal. The authors argue this directly measures the causal contribution of contextualized patch representations after initial formation but before late-layer mixing, outperforming gradient, attention, relevance-propagation, and perturbation baselines in faithfulness and localization across multiple ViT backbones and standard metrics.

Significance. If the empirical claims hold after addressing the robustness issues below, CAAP would supply a more direct causal test of patch influence than input perturbations or backward signals, addressing a recognized gap in ViT interpretability where patch interactions across layers determine predictions.

major comments (1)

[§3.2] §3.2 (Activation Patching Procedure): The central causal claim—that the score change isolates the predictive effect of the source patch activations—rests on the untested assumption that the neutral target context and chosen intermediate layer bounds introduce no systematic confounding correlated with source content. No ablation varying the neutral context construction (e.g., zero activations vs. random images vs. class-averaged patches) or layer range is reported, leaving open the possibility that reported gains in faithfulness are artifacts of the intervention design rather than evidence of superior causal measurement.

minor comments (2)

[Abstract and §3.1] The abstract and §3.1 would benefit from an explicit one-sentence definition of the neutral target context (e.g., input image, activation tensor, or token sequence) before describing the patching operation.
[Figure 2] Figure 2 caption should state the exact layer indices used for the intermediate range and the backbone variant shown, to allow direct replication of the visualized attribution maps.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on the activation patching procedure in detail below and will revise the paper accordingly to strengthen the causal claims.

read point-by-point responses

Referee: [§3.2] §3.2 (Activation Patching Procedure): The central causal claim—that the score change isolates the predictive effect of the source patch activations—rests on the untested assumption that the neutral target context and chosen intermediate layer bounds introduce no systematic confounding correlated with source content. No ablation varying the neutral context construction (e.g., zero activations vs. random images vs. class-averaged patches) or layer range is reported, leaving open the possibility that reported gains in faithfulness are artifacts of the intervention design rather than evidence of superior causal measurement.

Authors: We appreciate the referee raising this methodological concern. The neutral target context in CAAP is constructed from mean activations over a held-out set of images (distinct from both source and evaluation sets) to provide a baseline without class-specific content, and the intermediate layer range is selected to intervene after initial patch embedding and self-attention mixing but prior to final global pooling and classification. While these choices follow conventions from activation patching literature in NLP and vision, we acknowledge that the absence of explicit ablations on alternative contexts (zero, random, or class-averaged) and layer bounds leaves room for potential confounding. To address this directly, we have run additional experiments ablating the neutral context construction and varying the layer range (early: layers 1-4, mid: 4-8, late: 8-12) across ViT-B/16 and ViT-L/16 backbones. The results demonstrate that CAAP maintains superior faithfulness (e.g., higher AOPC and lower MoRF scores) and localization metrics compared to baselines in all variants, with only minor quantitative shifts that do not alter the ranking. We will incorporate these ablations as a new subsection in §3.2, an extended results table, and discussion of design rationale in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in CAAP derivation

full rationale

The paper defines CAAP directly as an activation-patching intervention on intermediate layers of a ViT, with the attribution signal being the change in target-class score after inserting source activations into a neutral context. This procedure is self-contained as an empirical measurement technique rather than a derivation that reduces to fitted parameters, self-referential definitions, or load-bearing self-citations. The justification for preferring intermediate layers (capturing semantic evidence while avoiding late global mixing) follows from the stated architecture of ViTs and the intervention design itself, without invoking uniqueness theorems or prior author results that would collapse the claim. No equations or steps in the provided abstract reduce the output attribution map to an input by construction; the method's faithfulness claims rest on comparative evaluation against baselines, which is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard ViT architecture and the assumption that a neutral target context can be constructed without its own class-relevant signals interfering with the measurement.

axioms (1)

domain assumption Vision Transformers form class-relevant evidence through interactions between patch tokens across layers.
Stated in the abstract as the key challenge that motivates the method.

pith-pipeline@v0.9.0 · 5819 in / 1111 out tokens · 39743 ms · 2026-05-21T11:05:48.686011+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The causal intervention serves as a principled measure of patch influence by capturing semantic evidence after initial representation formation, while avoiding late-layer global mixing

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How LLMs Are Persuaded: A Few Attention Heads, Rerouted
cs.AI 2026-05 unverdicted novelty 7.0

Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Quantifying Attention Flow in Transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020. doi: 10.18653/v1/2020.acl-main.385. URLhttps://aclanthology.org/2020.acl-main.385/

work page doi:10.18653/v1/2020.acl-main.385 2020
[2]

Grad-sam: Explaining transformers via gradient self-attention maps

Oren Barkan, Edan Hauon, Avi Caciularu, Ori Katz, Itzik Malkiel, Omri Armstrong, and Noam Koenigstein. Grad-sam: Explaining transformers via gradient self-attention maps. InProceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM), pages 2882–2887, 2021. doi: 10.1145/3459637.3482126

work page doi:10.1145/3459637.3482126 2021
[3]

What's the Point: Semantic Segmentation with Point Supervision

Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision, 2016. URLhttps://arxiv.org/abs/1506.02106

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

Food-101 – mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision, 2014

work page 2014
[5]

Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

Aditya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. Grad-cam++: Improved visual explanations for deep convolutional networks, 2017. URL https://arxiv.org/abs/ 1710.11063

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Balasubramanian

Aditya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. Neural network attributions: A causal perspective. InProceedings of ICML, 2019. URL https://proceedings.mlr. press/v97/chattopadhyay19a.html

work page 2019
[7]

Transformer interpretability beyond attention visualization

Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 782–791, 2021

work page 2021
[8]

Atman: Understanding transformer predictions through memory efficient attention manipulation.arXiv preprint arXiv:2301.08110, 2023

Mayukh Deb, Björn Deiseroth, Samuel Weinbach, Patrick Schramowski, and Kristian Kersting. Atman: Understanding transformer predictions through memory efficient attention manipulation.arXiv preprint arXiv:2301.08110, 2023. URLhttps://arxiv.org/abs/2301.08110

work page arXiv 2023
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URL https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Explaining through transformer input sam- pling

Alexandre Englebert, Sédrick Stassin, Géraldin Nanfack, Sidi Ahmed Mahmoudi, Xavier Siebert, Olivier Cornu, and Christophe De Vleeschouwer. Explaining through transformer input sam- pling. InProceedings of the IEEE/CVF International Conference on Computer Vision Work- shops (ICCVW), 2023. URL https://openaccess.thecvf.com/content/ICCV2023W/NIVT/html/ Engl...

work page 2023
[11]

Fong and Andrea Vedaldi

Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. InProceedings of ICCV, 2017. URL https://www.robots.ox.ac.uk/~vgg/publications/2017/ Fong17/

work page 2017
[12]

Fong, Mandela Patrick, and Andrea Vedaldi

Ruth C. Fong, Mandela Patrick, and Andrea Vedaldi. Understanding deep networks via extremal perturbations and smooth masks. InProceedings of ICCV, 2019

work page 2019
[13]

Large- scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2023

Shanghua Gao, Zhong-Yu Li, Ming-Hsuan Yang, Ming-Ming Cheng, Junwei Han, and Philip Torr. Large- scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2023. doi: 10.1109/TPAMI.2022.3218275. URL https://doi.org/10.1109/TPAMI. 2022.3218275

work page doi:10.1109/tpami.2022.3218275 2023
[14]

Zhang, Shaoqing Ren, and Jian Sun

Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015. URL https://api.semanticscholar.org/CorpusID:206594692

work page 2016
[15]

Weinberger

Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2016. URL https://api.semanticscholar.org/CorpusID:9433631

work page 2017
[16]

Sarthak Jain and Byron C. Wallace. Attention is not explanation. InProceedings of NAACL, 2019. URL https://aclanthology.org/N19-1357/. 11

work page 2019
[17]

Attention is not only a weight: Analyzing transformers with vector norms

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not only a weight: Analyzing transformers with vector norms. InProceedings of EMNLP, pages 7057–7075, 2020

work page 2020
[18]

Multiplex network-based rep- resentation of vision transformers for visual explainability.Neural Computing and Applications, 37 (29):24385–24420, 2025

Michele Marchetti, Davide Traini, Domenico Ursino, and Luca Virgili. Multiplex network-based rep- resentation of vision transformers for visual explainability.Neural Computing and Applications, 37 (29):24385–24420, 2025. doi: 10.1007/s00521-025-11591-x. URL https://doi.org/10.1007/ s00521-025-11591-x

work page doi:10.1007/s00521-025-11591-x 2025
[19]

How to evaluate foreground maps? InCVPR, 2014

Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal. How to evaluate foreground maps? InCVPR, 2014

work page 2014
[20]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp

Faridoun Mehri, Mohsen Fayyaz, Mahdieh Soleymani Baghshah, and Mohammad Taher Pile- hvar. SkipPLUS: Skip the first few layers to better explain vision transformers. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 204–215, June 2024. doi: 10.1109/CVPRW63382.2024.00025. URL https://openaccess.the...

work page doi:10.1109/cvprw63382.2024.00025 2024
[21]

Libragrad: Balancing gradient flow for universally better vision transformer attributions

Faridoun Mehri, Mahdieh Soleymani Baghshah, and Mohammad Taher Pilehvar. Libragrad: Balancing gradient flow for universally better vision transformer attributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 67–78, June 2025

work page 2025
[22]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Dissecting query-key interaction in vision transformers

Xu Pan, Aaron Philip, Ziqian Xie, and Odelia Schwartz. Dissecting query-key interaction in vision transformers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=dIktpSgK4F

work page 2024
[24]

Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012

work page 2012
[25]

Rise: Randomized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models. InBMVC, 2018

work page 2018
[26]

Attcat: Explaining transformers via attentive class activation tokens

Yao Qiang, Deng Pan, Chengyin Li, Xin Li, Rhongho Jang, and Dongxiao Zhu. Attcat: Explaining transformers via attentive class activation tokens. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[27]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge, 2015. URLhttps://arxiv.org/abs/1409.0575

work page internal anchor Pith review Pith/arXiv arXiv 2015
[29]

Anders, and Klaus-Robert Müller

Wojciech Samek, Grégoire Montavon, Sebastian Lapuschkin, Christopher J. Anders, and Klaus-Robert Müller. Explaining deep neural networks and beyond: A review of methods and applications.Proceedings of the IEEE, 109(3):247–278, 2021

work page 2021
[30]

doi:10.1007/s11263-019-01228-7

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization.Inter- national Journal of Computer Vision, 128(2):336–359, 2019. doi: 10.1007/s11263-019-01228-7. URL https://doi.org/10.1007/s11263-019-01228-7

work page doi:10.1007/s11263-019-01228-7 2019
[31]

Not Just a Black Box: Learning Important Features Through Propagating Activation Differences

Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: Learning important features through propagating activation differences, 2016. URL https://arxiv.org/abs/ 1605.01713

work page internal anchor Pith review Pith/arXiv arXiv 2016
[32]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. URLhttps://api.semanticscholar.org/CorpusID:14124313. 12

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

SmoothGrad: removing noise by adding noise

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise, 2017. URLhttps://arxiv.org/abs/1706.03825

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

How to train your vit? data, augmentation, and regularization in vision transformers,

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers, 2021. URL https://arxiv.org/abs/2106.10270

work page arXiv 2021
[35]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 3319–3328. PMLR, 2017

work page 2017
[36]

Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification,

Mohammad Reza Taesiri, Giang Nguyen, Sarra Habchi, Cor-Paul Bezemer, and Anh Nguyen. Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification,

work page
[37]

URLhttps://arxiv.org/abs/2304.05538

work page arXiv
[38]

Deit iii: Revenge of the vit, 2022

Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit, 2022. URL https: //arxiv.org/abs/2204.07118

work page arXiv 2022
[39]

Metric-driven attributions for vision transformers

Chase Walker, Sumit Jha, and Rickard Ewetz. Metric-driven attributions for vision transformers. In International Conference on Learning Representations (ICLR), 2025. URL https://proceedings.iclr. cc/paper_files/paper/2025/file/4e21153e79aff242492146d78d09fcdb-Paper-Conference. pdf

work page 2025
[40]

Score-cam: Score-weighted visual explanations for convolutional neural networks

Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020

work page 2020
[41]

Weiyan Xie, Xiao-Hui Li, Caleb Chen Cao, and Nevin L. Zhang. Vit-cx: Causal explanation of vision transformers, 2023. URLhttps://arxiv.org/abs/2211.03064

work page arXiv 2023
[42]

Explaining information flow inside vision transformers using markov chain

Tingyi Yuan, Xuhong Li, Haoyi Xiong, Hui Cao, and Dejing Dou. Explaining information flow inside vision transformers using markov chain. InNeurIPS 2021 Workshop on eXplainable AI Approaches for Debugging and Diagnosis (XAI4Debugging), 2021. URLhttps://openreview.net/forum?id=TT-cf6QSDaQ

work page 2021
[43]

Towards best practices of activation patching in language models: Metrics and methods

Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods. InInternational Conference on Learning Representations (ICLR), 2024. URL https: //openreview.net/forum?id=Hf17y6u9BC. 13 A Broader Experiments In this section, we present additional experimental results that further expand the empirical evalua...

work page 2024

[1] [1]

Quantifying Attention Flow in Transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020. doi: 10.18653/v1/2020.acl-main.385. URLhttps://aclanthology.org/2020.acl-main.385/

work page doi:10.18653/v1/2020.acl-main.385 2020

[2] [2]

Grad-sam: Explaining transformers via gradient self-attention maps

Oren Barkan, Edan Hauon, Avi Caciularu, Ori Katz, Itzik Malkiel, Omri Armstrong, and Noam Koenigstein. Grad-sam: Explaining transformers via gradient self-attention maps. InProceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM), pages 2882–2887, 2021. doi: 10.1145/3459637.3482126

work page doi:10.1145/3459637.3482126 2021

[3] [3]

What's the Point: Semantic Segmentation with Point Supervision

Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision, 2016. URLhttps://arxiv.org/abs/1506.02106

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

Food-101 – mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision, 2014

work page 2014

[5] [5]

Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

Aditya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. Grad-cam++: Improved visual explanations for deep convolutional networks, 2017. URL https://arxiv.org/abs/ 1710.11063

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Balasubramanian

Aditya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. Neural network attributions: A causal perspective. InProceedings of ICML, 2019. URL https://proceedings.mlr. press/v97/chattopadhyay19a.html

work page 2019

[7] [7]

Transformer interpretability beyond attention visualization

Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 782–791, 2021

work page 2021

[8] [8]

Atman: Understanding transformer predictions through memory efficient attention manipulation.arXiv preprint arXiv:2301.08110, 2023

Mayukh Deb, Björn Deiseroth, Samuel Weinbach, Patrick Schramowski, and Kristian Kersting. Atman: Understanding transformer predictions through memory efficient attention manipulation.arXiv preprint arXiv:2301.08110, 2023. URLhttps://arxiv.org/abs/2301.08110

work page arXiv 2023

[9] [9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URL https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Explaining through transformer input sam- pling

Alexandre Englebert, Sédrick Stassin, Géraldin Nanfack, Sidi Ahmed Mahmoudi, Xavier Siebert, Olivier Cornu, and Christophe De Vleeschouwer. Explaining through transformer input sam- pling. InProceedings of the IEEE/CVF International Conference on Computer Vision Work- shops (ICCVW), 2023. URL https://openaccess.thecvf.com/content/ICCV2023W/NIVT/html/ Engl...

work page 2023

[11] [11]

Fong and Andrea Vedaldi

Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. InProceedings of ICCV, 2017. URL https://www.robots.ox.ac.uk/~vgg/publications/2017/ Fong17/

work page 2017

[12] [12]

Fong, Mandela Patrick, and Andrea Vedaldi

Ruth C. Fong, Mandela Patrick, and Andrea Vedaldi. Understanding deep networks via extremal perturbations and smooth masks. InProceedings of ICCV, 2019

work page 2019

[13] [13]

Large- scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2023

Shanghua Gao, Zhong-Yu Li, Ming-Hsuan Yang, Ming-Ming Cheng, Junwei Han, and Philip Torr. Large- scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2023. doi: 10.1109/TPAMI.2022.3218275. URL https://doi.org/10.1109/TPAMI. 2022.3218275

work page doi:10.1109/tpami.2022.3218275 2023

[14] [14]

Zhang, Shaoqing Ren, and Jian Sun

Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015. URL https://api.semanticscholar.org/CorpusID:206594692

work page 2016

[15] [15]

Weinberger

Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2016. URL https://api.semanticscholar.org/CorpusID:9433631

work page 2017

[16] [16]

Sarthak Jain and Byron C. Wallace. Attention is not explanation. InProceedings of NAACL, 2019. URL https://aclanthology.org/N19-1357/. 11

work page 2019

[17] [17]

Attention is not only a weight: Analyzing transformers with vector norms

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not only a weight: Analyzing transformers with vector norms. InProceedings of EMNLP, pages 7057–7075, 2020

work page 2020

[18] [18]

Multiplex network-based rep- resentation of vision transformers for visual explainability.Neural Computing and Applications, 37 (29):24385–24420, 2025

Michele Marchetti, Davide Traini, Domenico Ursino, and Luca Virgili. Multiplex network-based rep- resentation of vision transformers for visual explainability.Neural Computing and Applications, 37 (29):24385–24420, 2025. doi: 10.1007/s00521-025-11591-x. URL https://doi.org/10.1007/ s00521-025-11591-x

work page doi:10.1007/s00521-025-11591-x 2025

[19] [19]

How to evaluate foreground maps? InCVPR, 2014

Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal. How to evaluate foreground maps? InCVPR, 2014

work page 2014

[20] [20]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp

Faridoun Mehri, Mohsen Fayyaz, Mahdieh Soleymani Baghshah, and Mohammad Taher Pile- hvar. SkipPLUS: Skip the first few layers to better explain vision transformers. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 204–215, June 2024. doi: 10.1109/CVPRW63382.2024.00025. URL https://openaccess.the...

work page doi:10.1109/cvprw63382.2024.00025 2024

[21] [21]

Libragrad: Balancing gradient flow for universally better vision transformer attributions

Faridoun Mehri, Mahdieh Soleymani Baghshah, and Mohammad Taher Pilehvar. Libragrad: Balancing gradient flow for universally better vision transformer attributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 67–78, June 2025

work page 2025

[22] [22]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Dissecting query-key interaction in vision transformers

Xu Pan, Aaron Philip, Ziqian Xie, and Odelia Schwartz. Dissecting query-key interaction in vision transformers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=dIktpSgK4F

work page 2024

[24] [24]

Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012

work page 2012

[25] [25]

Rise: Randomized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models. InBMVC, 2018

work page 2018

[26] [26]

Attcat: Explaining transformers via attentive class activation tokens

Yao Qiang, Deng Pan, Chengyin Li, Xin Li, Rhongho Jang, and Dongxiao Zhu. Attcat: Explaining transformers via attentive class activation tokens. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[27] [27]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[28] [28]

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge, 2015. URLhttps://arxiv.org/abs/1409.0575

work page internal anchor Pith review Pith/arXiv arXiv 2015

[29] [29]

Anders, and Klaus-Robert Müller

Wojciech Samek, Grégoire Montavon, Sebastian Lapuschkin, Christopher J. Anders, and Klaus-Robert Müller. Explaining deep neural networks and beyond: A review of methods and applications.Proceedings of the IEEE, 109(3):247–278, 2021

work page 2021

[30] [30]

doi:10.1007/s11263-019-01228-7

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization.Inter- national Journal of Computer Vision, 128(2):336–359, 2019. doi: 10.1007/s11263-019-01228-7. URL https://doi.org/10.1007/s11263-019-01228-7

work page doi:10.1007/s11263-019-01228-7 2019

[31] [31]

Not Just a Black Box: Learning Important Features Through Propagating Activation Differences

Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: Learning important features through propagating activation differences, 2016. URL https://arxiv.org/abs/ 1605.01713

work page internal anchor Pith review Pith/arXiv arXiv 2016

[32] [32]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. URLhttps://api.semanticscholar.org/CorpusID:14124313. 12

work page internal anchor Pith review Pith/arXiv arXiv 2014

[33] [33]

SmoothGrad: removing noise by adding noise

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise, 2017. URLhttps://arxiv.org/abs/1706.03825

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

How to train your vit? data, augmentation, and regularization in vision transformers,

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers, 2021. URL https://arxiv.org/abs/2106.10270

work page arXiv 2021

[35] [35]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 3319–3328. PMLR, 2017

work page 2017

[36] [36]

Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification,

Mohammad Reza Taesiri, Giang Nguyen, Sarra Habchi, Cor-Paul Bezemer, and Anh Nguyen. Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification,

work page

[37] [37]

URLhttps://arxiv.org/abs/2304.05538

work page arXiv

[38] [38]

Deit iii: Revenge of the vit, 2022

Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit, 2022. URL https: //arxiv.org/abs/2204.07118

work page arXiv 2022

[39] [39]

Metric-driven attributions for vision transformers

Chase Walker, Sumit Jha, and Rickard Ewetz. Metric-driven attributions for vision transformers. In International Conference on Learning Representations (ICLR), 2025. URL https://proceedings.iclr. cc/paper_files/paper/2025/file/4e21153e79aff242492146d78d09fcdb-Paper-Conference. pdf

work page 2025

[40] [40]

Score-cam: Score-weighted visual explanations for convolutional neural networks

Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020

work page 2020

[41] [41]

Weiyan Xie, Xiao-Hui Li, Caleb Chen Cao, and Nevin L. Zhang. Vit-cx: Causal explanation of vision transformers, 2023. URLhttps://arxiv.org/abs/2211.03064

work page arXiv 2023

[42] [42]

Explaining information flow inside vision transformers using markov chain

Tingyi Yuan, Xuhong Li, Haoyi Xiong, Hui Cao, and Dejing Dou. Explaining information flow inside vision transformers using markov chain. InNeurIPS 2021 Workshop on eXplainable AI Approaches for Debugging and Diagnosis (XAI4Debugging), 2021. URLhttps://openreview.net/forum?id=TT-cf6QSDaQ

work page 2021

[43] [43]

Towards best practices of activation patching in language models: Metrics and methods

Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods. InInternational Conference on Learning Representations (ICLR), 2024. URL https: //openreview.net/forum?id=Hf17y6u9BC. 13 A Broader Experiments In this section, we present additional experimental results that further expand the empirical evalua...

work page 2024