Causal Attribution via Activation Patching
Pith reviewed 2026-05-21 11:05 UTC · model grok-4.3
The pith
Causal attribution via activation patching provides a direct measure of each patch's influence on Vision Transformer predictions by intervening on internal representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAAP estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal contribution of patch-associated internal representations on the model's prediction.
What carries the argument
The activation patching intervention that inserts source-image activations into a neutral target context over an intermediate range of layers to isolate causal contributions of patch representations.
If this is right
- Produces attribution maps that reflect the causal contribution of patch-associated internal representations.
- Consistently outperforms existing methods across multiple ViT backbones and standard metrics.
- Captures semantic evidence after initial representation formation.
- Avoids late-layer global mixing that can reduce spatial specificity.
- Yields more faithful and localized attributions in various settings.
Where Pith is reading between the lines
- Extending this patching approach to other transformer-based models could provide similar causal insights into token importance in natural language processing tasks.
- Optimizing the selection of the neutral target context might further improve the precision of the attributions beyond what is demonstrated here.
- The method's focus on intermediate layers suggests potential for hybrid attribution techniques that combine early and mid-layer interventions for even better localization.
Load-bearing premise
That inserting source-image activations into a neutral target context over an intermediate range of layers isolates the causal contribution of individual patch representations without introducing confounding effects from the choice of neutral context or layer range.
What would settle it
If attribution results vary significantly depending on which neutral target image is chosen or which specific intermediate layers are selected for patching, this would challenge the claim that the intervention reliably measures patch influence independent of those choices.
Figures
read the original abstract
Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing attribution methods face several limitations, with gradient-based, relevance-propagation, and attention-based methods relying on local approximations, while perturbation or optimization-based methods intervene on inputs, tokens, or surrogates rather than internal patch representations. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers; methods that operate only on input changes, attention weights, or backward relevance signals may therefore provide indirect proxies for patch importance rather than directly testing the predictive effect of contextualized patch representations. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal contribution of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing semantic evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP consistently outperforms existing methods in various settings and produces more faithful and localized attributions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Causal Attribution via Activation Patching (CAAP) for Vision Transformers. For each image patch, source activations are inserted into a neutral target context over an intermediate layer range; the resulting change in target-class score serves as the attribution signal. The authors argue this directly measures the causal contribution of contextualized patch representations after initial formation but before late-layer mixing, outperforming gradient, attention, relevance-propagation, and perturbation baselines in faithfulness and localization across multiple ViT backbones and standard metrics.
Significance. If the empirical claims hold after addressing the robustness issues below, CAAP would supply a more direct causal test of patch influence than input perturbations or backward signals, addressing a recognized gap in ViT interpretability where patch interactions across layers determine predictions.
major comments (1)
- [§3.2] §3.2 (Activation Patching Procedure): The central causal claim—that the score change isolates the predictive effect of the source patch activations—rests on the untested assumption that the neutral target context and chosen intermediate layer bounds introduce no systematic confounding correlated with source content. No ablation varying the neutral context construction (e.g., zero activations vs. random images vs. class-averaged patches) or layer range is reported, leaving open the possibility that reported gains in faithfulness are artifacts of the intervention design rather than evidence of superior causal measurement.
minor comments (2)
- [Abstract and §3.1] The abstract and §3.1 would benefit from an explicit one-sentence definition of the neutral target context (e.g., input image, activation tensor, or token sequence) before describing the patching operation.
- [Figure 2] Figure 2 caption should state the exact layer indices used for the intermediate range and the backbone variant shown, to allow direct replication of the visualized attribution maps.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment on the activation patching procedure in detail below and will revise the paper accordingly to strengthen the causal claims.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Activation Patching Procedure): The central causal claim—that the score change isolates the predictive effect of the source patch activations—rests on the untested assumption that the neutral target context and chosen intermediate layer bounds introduce no systematic confounding correlated with source content. No ablation varying the neutral context construction (e.g., zero activations vs. random images vs. class-averaged patches) or layer range is reported, leaving open the possibility that reported gains in faithfulness are artifacts of the intervention design rather than evidence of superior causal measurement.
Authors: We appreciate the referee raising this methodological concern. The neutral target context in CAAP is constructed from mean activations over a held-out set of images (distinct from both source and evaluation sets) to provide a baseline without class-specific content, and the intermediate layer range is selected to intervene after initial patch embedding and self-attention mixing but prior to final global pooling and classification. While these choices follow conventions from activation patching literature in NLP and vision, we acknowledge that the absence of explicit ablations on alternative contexts (zero, random, or class-averaged) and layer bounds leaves room for potential confounding. To address this directly, we have run additional experiments ablating the neutral context construction and varying the layer range (early: layers 1-4, mid: 4-8, late: 8-12) across ViT-B/16 and ViT-L/16 backbones. The results demonstrate that CAAP maintains superior faithfulness (e.g., higher AOPC and lower MoRF scores) and localization metrics compared to baselines in all variants, with only minor quantitative shifts that do not alter the ranking. We will incorporate these ablations as a new subsection in §3.2, an extended results table, and discussion of design rationale in the revised manuscript. revision: yes
Circularity Check
No significant circularity detected in CAAP derivation
full rationale
The paper defines CAAP directly as an activation-patching intervention on intermediate layers of a ViT, with the attribution signal being the change in target-class score after inserting source activations into a neutral context. This procedure is self-contained as an empirical measurement technique rather than a derivation that reduces to fitted parameters, self-referential definitions, or load-bearing self-citations. The justification for preferring intermediate layers (capturing semantic evidence while avoiding late global mixing) follows from the stated architecture of ViTs and the intervention design itself, without invoking uniqueness theorems or prior author results that would collapse the claim. No equations or steps in the provided abstract reduce the output attribution map to an input by construction; the method's faithfulness claims rest on comparative evaluation against baselines, which is externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vision Transformers form class-relevant evidence through interactions between patch tokens across layers.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The causal intervention serves as a principled measure of patch influence by capturing semantic evidence after initial representation formation, while avoiding late-layer global mixing
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
Reference graph
Works this paper leans on
-
[1]
Quantifying Attention Flow in Transformers
Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020. doi: 10.18653/v1/2020.acl-main.385. URLhttps://aclanthology.org/2020.acl-main.385/
-
[2]
Grad-sam: Explaining transformers via gradient self-attention maps
Oren Barkan, Edan Hauon, Avi Caciularu, Ori Katz, Itzik Malkiel, Omri Armstrong, and Noam Koenigstein. Grad-sam: Explaining transformers via gradient self-attention maps. InProceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM), pages 2882–2887, 2021. doi: 10.1145/3459637.3482126
-
[3]
What's the Point: Semantic Segmentation with Point Supervision
Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision, 2016. URLhttps://arxiv.org/abs/1506.02106
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[4]
Food-101 – mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision, 2014
work page 2014
-
[5]
Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks
Aditya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. Grad-cam++: Improved visual explanations for deep convolutional networks, 2017. URL https://arxiv.org/abs/ 1710.11063
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
Aditya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. Neural network attributions: A causal perspective. InProceedings of ICML, 2019. URL https://proceedings.mlr. press/v97/chattopadhyay19a.html
work page 2019
-
[7]
Transformer interpretability beyond attention visualization
Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 782–791, 2021
work page 2021
-
[8]
Mayukh Deb, Björn Deiseroth, Samuel Weinbach, Patrick Schramowski, and Kristian Kersting. Atman: Understanding transformer predictions through memory efficient attention manipulation.arXiv preprint arXiv:2301.08110, 2023. URLhttps://arxiv.org/abs/2301.08110
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URL https://arxiv.org/abs/2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Explaining through transformer input sam- pling
Alexandre Englebert, Sédrick Stassin, Géraldin Nanfack, Sidi Ahmed Mahmoudi, Xavier Siebert, Olivier Cornu, and Christophe De Vleeschouwer. Explaining through transformer input sam- pling. InProceedings of the IEEE/CVF International Conference on Computer Vision Work- shops (ICCVW), 2023. URL https://openaccess.thecvf.com/content/ICCV2023W/NIVT/html/ Engl...
work page 2023
-
[11]
Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. InProceedings of ICCV, 2017. URL https://www.robots.ox.ac.uk/~vgg/publications/2017/ Fong17/
work page 2017
-
[12]
Fong, Mandela Patrick, and Andrea Vedaldi
Ruth C. Fong, Mandela Patrick, and Andrea Vedaldi. Understanding deep networks via extremal perturbations and smooth masks. InProceedings of ICCV, 2019
work page 2019
-
[13]
Shanghua Gao, Zhong-Yu Li, Ming-Hsuan Yang, Ming-Ming Cheng, Junwei Han, and Philip Torr. Large- scale unsupervised semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7457–7476, 2023. doi: 10.1109/TPAMI.2022.3218275. URL https://doi.org/10.1109/TPAMI. 2022.3218275
-
[14]
Zhang, Shaoqing Ren, and Jian Sun
Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015. URL https://api.semanticscholar.org/CorpusID:206594692
work page 2016
-
[15]
Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2016. URL https://api.semanticscholar.org/CorpusID:9433631
work page 2017
-
[16]
Sarthak Jain and Byron C. Wallace. Attention is not explanation. InProceedings of NAACL, 2019. URL https://aclanthology.org/N19-1357/. 11
work page 2019
-
[17]
Attention is not only a weight: Analyzing transformers with vector norms
Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not only a weight: Analyzing transformers with vector norms. InProceedings of EMNLP, pages 7057–7075, 2020
work page 2020
-
[18]
Michele Marchetti, Davide Traini, Domenico Ursino, and Luca Virgili. Multiplex network-based rep- resentation of vision transformers for visual explainability.Neural Computing and Applications, 37 (29):24385–24420, 2025. doi: 10.1007/s00521-025-11591-x. URL https://doi.org/10.1007/ s00521-025-11591-x
-
[19]
How to evaluate foreground maps? InCVPR, 2014
Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal. How to evaluate foreground maps? InCVPR, 2014
work page 2014
-
[20]
In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp
Faridoun Mehri, Mohsen Fayyaz, Mahdieh Soleymani Baghshah, and Mohammad Taher Pile- hvar. SkipPLUS: Skip the first few layers to better explain vision transformers. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 204–215, June 2024. doi: 10.1109/CVPRW63382.2024.00025. URL https://openaccess.the...
-
[21]
Libragrad: Balancing gradient flow for universally better vision transformer attributions
Faridoun Mehri, Mahdieh Soleymani Baghshah, and Mohammad Taher Pilehvar. Libragrad: Balancing gradient flow for universally better vision transformer attributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 67–78, June 2025
work page 2025
-
[22]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Dissecting query-key interaction in vision transformers
Xu Pan, Aaron Philip, Ziqian Xie, and Odelia Schwartz. Dissecting query-key interaction in vision transformers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=dIktpSgK4F
work page 2024
-
[24]
Parkhi, Andrea Vedaldi, Andrew Zisserman, and C
Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012
work page 2012
-
[25]
Rise: Randomized input sampling for explanation of black-box models
Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models. InBMVC, 2018
work page 2018
-
[26]
Attcat: Explaining transformers via attentive class activation tokens
Yao Qiang, Deng Pan, Chengyin Li, Xin Li, Rhongho Jang, and Dongxiao Zhu. Attcat: Explaining transformers via attentive class activation tokens. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[27]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge, 2015. URLhttps://arxiv.org/abs/1409.0575
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[29]
Anders, and Klaus-Robert Müller
Wojciech Samek, Grégoire Montavon, Sebastian Lapuschkin, Christopher J. Anders, and Klaus-Robert Müller. Explaining deep neural networks and beyond: A review of methods and applications.Proceedings of the IEEE, 109(3):247–278, 2021
work page 2021
-
[30]
doi:10.1007/s11263-019-01228-7
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization.Inter- national Journal of Computer Vision, 128(2):336–359, 2019. doi: 10.1007/s11263-019-01228-7. URL https://doi.org/10.1007/s11263-019-01228-7
-
[31]
Not Just a Black Box: Learning Important Features Through Propagating Activation Differences
Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: Learning important features through propagating activation differences, 2016. URL https://arxiv.org/abs/ 1605.01713
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[32]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. URLhttps://api.semanticscholar.org/CorpusID:14124313. 12
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[33]
SmoothGrad: removing noise by adding noise
Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise, 2017. URLhttps://arxiv.org/abs/1706.03825
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
How to train your vit? data, augmentation, and regularization in vision transformers,
Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers, 2021. URL https://arxiv.org/abs/2106.10270
-
[35]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 3319–3328. PMLR, 2017
work page 2017
-
[36]
Mohammad Reza Taesiri, Giang Nguyen, Sarra Habchi, Cor-Paul Bezemer, and Anh Nguyen. Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification,
- [37]
-
[38]
Deit iii: Revenge of the vit, 2022
Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit, 2022. URL https: //arxiv.org/abs/2204.07118
-
[39]
Metric-driven attributions for vision transformers
Chase Walker, Sumit Jha, and Rickard Ewetz. Metric-driven attributions for vision transformers. In International Conference on Learning Representations (ICLR), 2025. URL https://proceedings.iclr. cc/paper_files/paper/2025/file/4e21153e79aff242492146d78d09fcdb-Paper-Conference. pdf
work page 2025
-
[40]
Score-cam: Score-weighted visual explanations for convolutional neural networks
Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020
work page 2020
- [41]
-
[42]
Explaining information flow inside vision transformers using markov chain
Tingyi Yuan, Xuhong Li, Haoyi Xiong, Hui Cao, and Dejing Dou. Explaining information flow inside vision transformers using markov chain. InNeurIPS 2021 Workshop on eXplainable AI Approaches for Debugging and Diagnosis (XAI4Debugging), 2021. URLhttps://openreview.net/forum?id=TT-cf6QSDaQ
work page 2021
-
[43]
Towards best practices of activation patching in language models: Metrics and methods
Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods. InInternational Conference on Learning Representations (ICLR), 2024. URL https: //openreview.net/forum?id=Hf17y6u9BC. 13 A Broader Experiments In this section, we present additional experimental results that further expand the empirical evalua...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.