Seeing Through Circuits: Faithful Mechanistic Interpretability for Vision Transformers
Pith reviewed 2026-05-10 12:28 UTC · model grok-4.3
The pith
Vision transformers contain recoverable edge-based circuits that explain image classification and allow correction of attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose Automatic Visual Circuit Discovery (Vi-CD) and demonstrate that it recovers class-specific circuits for classification in vision transformers, circuits that underlie typographic attacks in CLIP, and circuits that can be steered to correct harmful model behavior. These edge-based graphs add transparency by showing how information is routed through the model rather than only which features are encoded.
What carries the argument
Automatic Visual Circuit Discovery (Vi-CD), a method that identifies computational graphs formed by edges connecting components inside vision transformers for specific tasks.
If this is right
- Class-specific circuits can be used to trace exactly which connections the model relies on when recognizing a given object category.
- Typographic attack circuits make visible the pathways through which overlaid text influences the output.
- Steerable circuits provide targeted points for intervening to reduce unwanted or incorrect behaviors without retraining the entire model.
- Edge-based circuits supply routing details that neuron-only analyses miss.
Where Pith is reading between the lines
- The same discovery approach could be tested on other vision tasks such as detection or segmentation to see if similar structures appear.
- Comparing recovered circuits across different vision transformer variants might reveal whether core routing patterns are shared.
- These circuits open a route to model editing where only the relevant edges are modified to change behavior on narrow tasks.
Load-bearing premise
The circuits located by the method reflect the model's genuine internal computations rather than patterns created by the search procedure itself.
What would settle it
A test in which ablating or editing the edges of a discovered circuit leaves the model's classification accuracy or attack susceptibility unchanged would show that the circuit does not capture the actual reasoning.
Figures
read the original abstract
Transparency of neural networks' internal reasoning is at the heart of interpretability research, adding to trust, safety, and understanding of these models. The field of mechanistic interpretability has recently focused on studying task-specific computational graphs, defined by connections (edges) between model components. Such edge-based circuits have been defined in the context of large language models, yet vision-based approaches so far only consider neuron-based circuits. These tell which information is encoded, but not how it is routed through the complex wiring of a neural network. In this work, we investigate whether useful mechanistic circuits can be identified through computational graphs in vision transformers. We propose an effective method for Automatic Visual Circuit Discovery (Vi-CD) that recovers class-specific circuits for classification, identifies circuits underlying typographic attacks in CLIP, and discovers circuits that lend themselves for steering to correct harmful model behavior. Overall, we find that insightful and actionable edge-based circuits can be recovered from vision transformers, adding transparency to the internal computations of these models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Vi-CD (Automatic Visual Circuit Discovery), a method to recover edge-based computational circuits in Vision Transformers. It claims to identify class-specific circuits for image classification, circuits underlying typographic attacks in CLIP, and steerable circuits that can correct harmful model behaviors, thereby extending mechanistic interpretability from language models to vision transformers.
Significance. If the recovered circuits prove mechanistically faithful, the work would meaningfully extend edge-based circuit analysis to ViTs, offering potential for greater transparency, safety interventions, and behavioral steering in vision and multimodal models.
major comments (2)
- Abstract: the abstract asserts success on classification, attack identification, and steering but supplies no quantitative results, validation metrics, baselines, or method details, leaving central claims unsupported in available text.
- The load-bearing claim that Vi-CD recovers causally faithful mechanistic pathways (rather than correlational patterns or discovery-heuristic artifacts) lacks concrete validation via interventions, ablations, or ground-truth comparisons; without these, the transparency benefit cannot be established.
minor comments (1)
- The title and abstract use 'faithful' without an explicit operational definition or set of verification criteria tailored to vision transformers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our paper extending edge-based circuit analysis to Vision Transformers. We address each major comment point by point below, with clarifications on our validation approach and revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: the abstract asserts success on classification, attack identification, and steering but supplies no quantitative results, validation metrics, baselines, or method details, leaving central claims unsupported in available text.
Authors: We agree that the abstract would be strengthened by including quantitative highlights. In the revised manuscript, we have updated the abstract to incorporate key metrics, including the fraction of model accuracy retained by recovered circuits (e.g., >90% on ImageNet subsets), steering success rates for typographic attack correction (e.g., 75% reduction in attack efficacy), and brief baseline comparisons to random and activation-based methods. Full methodological details and additional results remain in the main text and supplementary material. revision: yes
-
Referee: The load-bearing claim that Vi-CD recovers causally faithful mechanistic pathways (rather than correlational patterns or discovery-heuristic artifacts) lacks concrete validation via interventions, ablations, or ground-truth comparisons; without these, the transparency benefit cannot be established.
Authors: We appreciate this emphasis on causal validation. Our experiments already include intervention-based tests: we ablate and activate discovered edges to measure direct causal effects on model logits and outputs, showing that circuit interventions predictably alter classification decisions and mitigate typographic attacks in CLIP, while non-circuit edges do not. We have added further ablations in the revision, comparing Vi-CD circuits against random edge subsets and alternative heuristics (e.g., activation patching baselines), with results demonstrating superior causal faithfulness via metrics such as logit difference and behavioral change scores. Although ground-truth circuits are unavailable for these complex models, we use controlled proxy tasks and faithfulness quantification to distinguish mechanistic pathways from correlations. revision: partial
Circularity Check
No circularity; method applied to external model behaviors without self-referential reduction
full rationale
The paper introduces Vi-CD as a method to recover edge-based circuits in vision transformers and demonstrates its application to class-specific classification circuits, CLIP typographic attack circuits, and steerable circuits for correcting behavior. No derivation chain, equation, or self-citation reduces a claimed prediction or result to a fitted parameter or prior definition by construction. The central results consist of empirical recovery and validation on held-out model behaviors rather than tautological mappings. Any self-citations present are non-load-bearing for the core claims, which rest on the novel application and observed outcomes.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2602.22968 (2026) 4
Anani, A., Lorenz, T., Schiele, B., Fritz, M., Fischer, J.: Certified circuits: Stability guarantees for mechanistic circuits. arXiv preprint arXiv:2602.22968 (2026) 4
work page internal anchor Pith review arXiv 2026
-
[2]
In: Advances in Neural Information Processing Systems (2024) 8
Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., Nanda, N.: Refusal in language models is mediated by a single direction. In: Advances in Neural Information Processing Systems (2024) 8
work page 2024
-
[3]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025) 7
Bader, J., Girrbach, L., Alaniz, S., Akata, Z.: SUB: Benchmarking CBM generalization via synthetic attribute substitutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025) 7
work page 2025
-
[4]
Transactions on Machine Learning Research (2024) 2
Bereska, L., Gavves, S.: Mechanistic interpretability for AI safety – a review. Transactions on Machine Learning Research (2024) 2
work page 2024
-
[5]
In: Advances in Neural Information Processing Systems
Bhaskar, A., Wettig, A., Friedman, D., Chen, D.: Finding transformer circuits with edge pruning. In: Advances in Neural Information Processing Systems. pp. 18506–18534 (2024) 2, 3
work page 2024
-
[6]
Cammarata, N., Carter, S., Goh, G., Olah, C., Petrov, M., Schubert, L., Voss, C., Egan, B., Lim, S.K.: Thread: Circuits. Distill (2020). https: //doi.org/10.23915/distill.000242, 3
-
[7]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gor- don, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2818–2829 (2023) 7, 9, 18, 21
work page 2023
-
[8]
In: Advances in Neural Information Processing Systems
Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., Garriga-Alonso, A.: Towards automated circuit discovery for mechanistic interpretability. In: Advances in Neural Information Processing Systems. pp. 16318–16352 (2023) 2, 3, 5, 6, 34
work page 2023
-
[9]
In: Proceedings of the World Conference on Explainable Artificial Intelligence
Dorszewski, T., Tětková, L., Jenssen, R., Hansen, L.K., Wickstrøm, K.K.: From colors to classes: Emergence of concepts in vision transformers. In: Proceedings of the World Conference on Explainable Artificial Intelligence. pp. 28–47 (2025) 11
work page 2025
-
[10]
In: Proceedings of the International Conference on Learning Representations (2021) 7, 9, 18
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un- terthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations (2021) 7, 9, 18
work page 2021
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
Dreyer, M., Purelku, E., Vielhaben, J., Samek, W., Lapuschkin, S.: PURE: Turning polysemantic neurons into pure features by identifying relevant circuits. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 8212–8217 (2024) 2, 3
work page 2024
-
[12]
Transformer Circuits Thread (2021) 2 16 N
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., et al.: A mathematical framework for transformer circuits. Transformer Circuits Thread (2021) 2 16 N. Żukowska et al
work page 2021
-
[13]
Nature Machine Intelligence2(11), 665–673 (2020) 2
Geirhos, R., Jacobsen, J.H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., Wichmann, F.A.: Shortcut learning in deep neural networks. Nature Machine Intelligence2(11), 665–673 (2020) 2
work page 2020
-
[14]
arXiv preprint arXiv:2206.01627 (2022) 3
Hamblin, C.J., Konkle, T., Alvarez, G.A.: Pruning for interpretable, feature- preserving circuits in CNNs. arXiv preprint arXiv:2206.01627 (2022) 3
-
[15]
In: Advances in Neural Information Processing Systems (2023) 2
Hanna, M., Liu, O., Variengien, A.: How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In: Advances in Neural Information Processing Systems (2023) 2
work page 2023
-
[16]
arXiv preprint arXiv:2403.17806 (2024) 3, 5, 9, 19
Hanna, M., Pezzelle, S., Belinkov, Y.: Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms. arXiv preprint arXiv:2403.17806 (2024) 3, 5, 9, 19
-
[17]
Howard, J.: Imagenette: A smaller subset of 10 easily classified classes from ImageNet (2019),https://github.com/fastai/imagenette31, 33
work page 2019
-
[18]
In: Proceedings of the International Conference on Learning Representations (2025) 3
Hsu, A.R., Zhou, G., Cherapanamjeri, Y., Huang, Y., Odisho, A.Y., Carroll, P.R., Yu, B.: Efficient automated circuit discovery in transformers using contextual decomposition. In: Proceedings of the International Conference on Learning Representations (2025) 3
work page 2025
-
[19]
In: Proceedings of the International Conference on Learning Representations (2026) 2, 8
Hufe, L., Venhoff, C., Dreyer, M., Purelku, E., Lapuschkin, S., Samek, W.: Dyslexify: A mechanistic defense against typographic attacks in CLIP. In: Proceedings of the International Conference on Learning Representations (2026) 2, 8
work page 2026
-
[20]
InInternational Conference on Learning Representations
Jafari, F.R., Eberle, O., Khakzar, A., Nanda, N.: RelP: Faithful and efficient circuit discovery in language models via relevance patching. arXiv preprint arXiv:2508.21258 (2025) 3
-
[21]
arXiv preprint arXiv:2504.19475 (2025) 9, 19
Joseph, S., Suresh, P., Hufe, L., Stevinson, E., Graham, R., Vadi, Y., Bzdok, D., Lapuschkin, S., Sharkey, L., Richards, B.A.: Prisma: An open source toolkit for mechanistic interpretability in vision and video. arXiv preprint arXiv:2504.19475 (2025) 9, 19
-
[22]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Kowal, M., Wildes, R.P., Derpanis, K.G.: Visual concept connectome (VCC): Open world concept discovery and their interlayer connections in deep models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10895–10905 (2024) 3
work page 2024
-
[23]
In: Advances in Neural Information Processing Systems
Lindner, D., Kramár, J., Farquhar, S., Rahtz, M., McGrath, T., Mikulik, V.: Tracr: Compiled transformers as a laboratory for interpretability. In: Advances in Neural Information Processing Systems. pp. 37876–37899 (2023) 2
work page 2023
-
[24]
In: Proceedings of the Annual Meeting of the Association for Computational Linguistics
Mondorf, P., Wold, S., Plank, B.: Circuit compositions: Exploring modular structures in transformer-based language models. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. pp. 14934– 14955 (2025) 14
work page 2025
-
[25]
arXiv preprint arXiv:2503.09399 (2025) 7, 18
Nauen, T.C., Moser, B., Raue, F., Frolov, S., Dengel, A.: ForAug: Recom- bining foregrounds and backgrounds to improve vision transformer training with bias mitigation. arXiv preprint arXiv:2503.09399 (2025) 7, 18
-
[26]
In: Proceedings of the European Conference on Computer Vision
Park, S., Um, D., Yoon, H., Chun, S., Yun, S.: RoCOCO: Robustness benchmark of MS-COCO to stress-test image-text matching models. In: Proceedings of the European Conference on Computer Vision. pp. 71–91 (2024) 9, 12, 13, 28, 29 Faithful Mechanistic Interpretability for Vision Transformers 17
work page 2024
-
[27]
In: Proceedings of the International Conference on Machine Learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning. pp. 8748–8763 (2021) 9, 18, 21
work page 2021
-
[28]
arXiv preprint arXiv:2404.14349 (2024) 2, 3
Rajaram, A., Chowdhury, N., Torralba, A., Andreas, J., Schwettmann, S.: Automatic discovery of visual circuits. arXiv preprint arXiv:2404.14349 (2024) 2, 3
-
[29]
Nature Machine Intelligence2(8), 476–486 (2020) 2
Schramowski, P., Stammer, W., Teso, S., Brugger, A., Herbert, F., Shao, X., Luigs, H.G., Mahlein, A.K., Kersting, K.: Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nature Machine Intelligence2(8), 476–486 (2020) 2
work page 2020
-
[30]
Steinmann, D., Divo, F., Kraus, M., Wüst, A., Struppek, L., Friedrich, F., Kersting, K.: Navigating shortcuts, spurious correlations, and confounders: From origins via detection to mitigation. arXiv preprint arXiv:2412.05152 (2024) 2
-
[31]
In: Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Syed, A., Rager, C., Conmy, A.: Attribution patching outperforms auto- mated circuit discovery. In: Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. pp. 407–416 (2024) 3, 9, 19
work page 2024
-
[32]
Angular steering: Behavior control via rotation in activation space.arXiv preprint arXiv:2510.26243,
Vu, H.M., Nguyen, T.M.: Angular steering: Behavior control via rotation in activation space. arXiv preprint arXiv:2510.26243 (2025) 8
-
[33]
In: Proceedings of the International Conference on Learning Representations (2023) 2
Wang, K.R., Variengien, A., Conmy, A., Shlegeris, B., Steinhardt, J.: In- terpretability in the wild: A circuit for indirect object identification in GPT-2 small. In: Proceedings of the International Conference on Learning Representations (2023) 2
work page 2023
-
[34]
Wang, X., Zhao, Z., Larson, M.: Typographic attacks in a multi-image setting. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 12594–12604 (2025) 2, 8, 9, 20
work page 2025
-
[35]
In: Proceedings of the International Conference on Learning Representations (2025) 3
Wang, Y., Liu, Y., Shi, Y., Li, C., Pang, A., Yang, S., Yu, J., Ren, K.: Discovering influential neuron path in vision transformers. In: Proceedings of the International Conference on Learning Representations (2025) 3
work page 2025
-
[36]
SCAM: A real-world typographic robustness evaluation for multimodal foundation models,
Westerhoff, J., Purelku, E., Hackstein, J., Pinetzki, L., Hufe, L.: SCAM: A real-world typographic robustness evaluation for multimodal foundation models. arXiv preprint arXiv:2504.04893 (2025) 2 18 N. Żukowska et al. A Mining class Circuits Dataset.Class circuits are mined using the ForAug dataset [25], which provides ImageNet images processed via a segm...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.