pith. machine review for the scientific record. sign in

arxiv: 2309.16042 · v2 · submitted 2023-09-27 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Authors on Pith no claims yet

Pith reviewed 2026-05-17 11:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords activation patchingmechanistic interpretabilitylanguage modelslocalizationcircuit discoveryevaluation metricscorruption methodsbest practices
0
0 comments X

The pith

Varying metrics and corruption methods in activation patching can produce conflicting pictures of which model components matter.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how small changes in the way activation patching is performed affect results when researchers try to find important parts inside language models. Different choices for scoring the patch effect and for corrupting the original input often point to different components or circuits. A reader should care because activation patching is the main tool for causal localization, so inconsistent outcomes mean that published claims about model mechanisms may depend on unstated methodological decisions. The authors run controlled comparisons across localization and circuit-discovery tasks, give reasons why some metrics and corruption styles are more reliable than others, and close with concrete recommendations for future experiments.

Core claim

In multiple localization and circuit-discovery settings, changing the evaluation metric or the corruption procedure inside activation patching produces noticeably different rankings of important model components. The authors supply empirical evidence for these differences together with conceptual arguments favoring particular metrics and corruption strategies, and they distill the observations into a set of recommended practices for activation patching.

What carries the argument

Activation patching (also called causal tracing), with its choices of evaluation metric and input-corruption method, used to measure the causal importance of model components.

If this is right

  • Standardizing on a small set of metrics and corruption methods would reduce conflicting localization claims across papers.
  • Some commonly used metrics give systematically different importance scores than others, so results are not interchangeable.
  • Reporting results under multiple metric choices would make circuit discoveries more robust.
  • The recommendations can be adopted immediately to make new localization studies more reproducible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hyperparameter sensitivity may appear in other intervention-based interpretability techniques.
  • Creating public benchmark suites that measure how much patching results change with metric choice would let the community test the recommendations on new models.
  • Authors may need to treat the choice of patching variant as an explicit experimental variable rather than a fixed detail.

Load-bearing premise

The localization and circuit-discovery tasks and models the authors tested are representative of how activation patching is used more broadly.

What would settle it

A replication on the same models and tasks that finds the same localization and circuit results no matter which metric or corruption method is chosen would falsify the reported sensitivity.

read the original abstract

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript systematically studies the sensitivity of activation patching (also called causal tracing or interchange intervention) to choices of evaluation metrics and corruption methods when performing localization and circuit discovery in language models. Through experiments in several concrete settings, it reports that different methodological choices can produce inconsistent interpretability conclusions, supplies conceptual arguments for preferring particular metrics or corruption schemes, and distills these observations into concrete recommendations for future use of the technique.

Significance. If the reported sensitivities prove robust, the work addresses a genuine methodological gap: activation patching is widely used yet implemented with little standardization, so explicit guidance on metrics and corruption could improve reproducibility across mechanistic interpretability studies. The empirical comparisons themselves constitute a useful contribution even if the final recommendations require further qualification.

major comments (2)
  1. [§4] §4 (Experimental results on IOI and related tasks): the central empirical claim—that varying metrics and corruption methods yields disparate localization and circuit outcomes—is demonstrated only for the specific models, tasks, and corruption schemes examined. Because the derived best-practice recommendations are presented without additional validation on other architectures, scales, or interpretability targets, the extrapolation from these settings to general usage rests on an untested assumption of representativeness.
  2. [§5] §5 (Recommendations): the preference for certain metrics is supported by a combination of the reported empirical differences and conceptual arguments, yet the manuscript does not quantify the magnitude or statistical reliability of those differences (e.g., via confidence intervals or multiple-run statistics). This makes it difficult to judge whether the observed disparities are large enough to justify changing community practice.
minor comments (2)
  1. [Abstract / §1] The abstract and introduction would benefit from an explicit enumeration of the exact models, tasks, and corruption functions used in the main experiments so that readers can immediately assess the scope of the reported findings.
  2. [§3] Notation for the different patching metrics (e.g., the precise definitions of “direct effect,” “indirect effect,” or normalized variants) should be collected in a single table or subsection for easy reference when comparing results across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the detailed, constructive major comments. We address each point below and have revised the manuscript to incorporate additional discussion and statistical quantification where feasible.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental results on IOI and related tasks): the central empirical claim—that varying metrics and corruption methods yields disparate localization and circuit outcomes—is demonstrated only for the specific models, tasks, and corruption schemes examined. Because the derived best-practice recommendations are presented without additional validation on other architectures, scales, or interpretability targets, the extrapolation from these settings to general usage rests on an untested assumption of representativeness.

    Authors: We agree that the experiments are confined to specific, widely studied settings such as the IOI task and smaller-scale models like GPT-2. These were selected because they represent standard benchmarks in the mechanistic interpretability literature where activation patching is commonly applied. The inconsistencies we document even in these canonical cases already underscore the importance of methodological choices. In the revised manuscript we have added an explicit limitations and scope section that discusses the representativeness of our settings, notes that similar sensitivities are likely in other contexts, and recommends that future studies validate the proposed practices on additional architectures and tasks. We do not claim the recommendations are universally proven but present them as evidence-based guidance for prevalent use cases. revision: partial

  2. Referee: [§5] §5 (Recommendations): the preference for certain metrics is supported by a combination of the reported empirical differences and conceptual arguments, yet the manuscript does not quantify the magnitude or statistical reliability of those differences (e.g., via confidence intervals or multiple-run statistics). This makes it difficult to judge whether the observed disparities are large enough to justify changing community practice.

    Authors: We concur that quantifying the scale and reliability of the observed differences would strengthen the case for the recommendations. The revised version now includes results from multiple random seeds for the primary experiments, with error bars and approximate confidence intervals reported for key localization and circuit metrics. These additions allow readers to evaluate whether the disparities are sufficiently consistent and large to warrant changes in practice. The conceptual arguments remain as before but are now paired with this statistical context. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons and recommendations are self-contained

full rationale

The paper performs direct empirical experiments comparing activation patching metrics, corruption methods, and hyperparameters across specific localization and circuit discovery tasks in language models. It reports observed disparities in results and offers recommendations backed by those observations plus conceptual arguments, without any derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claims to prior inputs by construction. The analysis stands on its own experimental data rather than self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical methodological study. It introduces no new free parameters, axioms, or invented entities; it relies on standard assumptions from the mechanistic interpretability literature about what activation patching measures.

pith-pipeline@v0.9.0 · 5432 in / 968 out tokens · 44081 ms · 2026-05-17T11:52:22.351648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.

  2. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  3. Do Audio-Visual Large Language Models Really See and Hear?

    cs.AI 2026-04 unverdicted novelty 8.0

    AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.

  4. How LLMs Are Persuaded: A Few Attention Heads, Rerouted

    cs.AI 2026-05 unverdicted novelty 7.0

    Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.

  5. Data-driven Circuit Discovery for Interpretability of Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.

  6. Cell-Based Representation of Relational Binding in Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...

  7. Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

    cs.CR 2026-04 unverdicted novelty 7.0

    HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

  8. CURE:Circuit-Aware Unlearning for LLM-based Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    CURE disentangles LLM recommendation circuits into forget-specific, retain-specific, and task-shared modules with tailored update rules to achieve more effective unlearning than weighted baselines.

  9. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

  10. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  11. Knowledge Vector of Logical Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    Distinct linear knowledge vectors for deductive, inductive, and abductive reasoning in LLMs can be refined via complementary subspace constraints to improve performance through mutual knowledge sharing.

  12. How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

    cs.LG 2026-04 unverdicted novelty 6.0

    LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.

  13. Understanding the Mechanism of Altruism in Large Language Models

    econ.GN 2026-04 unverdicted novelty 6.0

    A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

  14. Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation

    cs.SE 2026-04 unverdicted novelty 6.0

    Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.

  15. Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task

    cs.LG 2026-04 unverdicted novelty 6.0

    Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.

  16. From Attribution to Action: A Human-Centered Application of Activation Steering

    cs.AI 2026-04 unverdicted novelty 6.0

    Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...

  17. How to use and interpret activation patching

    cs.LG 2024-04 accept novelty 5.0

    Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · cited by 16 Pith papers · 3 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Attention is all you need , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  2. [2]

    Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases , author=

  3. [3]

    Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors

    Yun, Zeyu and Chen, Yubei and Olshausen, Bruno and LeCun, Yann. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

  4. [5]

    International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

    Feature relevance quantification in explainable AI: A causal problem , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

  5. [6]

    A circuit for

    Heimersheim, Stefan and Janiak, Jett , year =. A circuit for

  6. [7]

    Transformer

    Nanda, Neel and Bloom, Joseph , howpublished = ". Transformer

  7. [8]

    Wang, Ben and Komatsuzaki, Aran , title =

  8. [9]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Investigating gender bias in language models using causal mediation analysis , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  9. [10]

    arXiv preprint arXiv:2307.03637 , year=

    Discovering Variable Binding Circuitry with Desiderata , author=. arXiv preprint arXiv:2307.03637 , year=

  10. [11]

    Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

    Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , author=. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

  11. [12]

    How does

    Hanna, Michael and Liu, Ollie and Variengien, Alexandre , booktitle=. How does

  12. [13]

    International Conference on Machine Learning (ICML) , year=

    How do transformers learn topic structure: Towards a mechanistic understanding , author=. International Conference on Machine Learning (ICML) , year=

  13. [14]

    Wen, Kaiyue and Li, Yuchen and Liu, Bingbin and Risteski, Andrej , booktitle=. (

  14. [15]

    International Conference on Learning Representations (ICLR) , year=

    Natural language descriptions of deep visual features , author=. International Conference on Learning Representations (ICLR) , year=

  15. [16]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  16. [17]

    Hidden progress in deep learning:

    Barak, Boaz and Edelman, Benjamin and Goel, Surbhi and Kakade, Sham and Malach, Eran and Zhang, Cyril , booktitle=. Hidden progress in deep learning:

  17. [18]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  18. [23]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

  19. [24]

    Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of

    Katz, Shahar and Belinkov, Yonatan , journal=. Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of

  20. [25]

    Advances in neural information processing systems (NeurIPS) , year=

    A benchmark for interpretability methods in deep neural networks , author=. Advances in neural information processing systems (NeurIPS) , year=

  21. [26]

    2022 , journal=

    In-context Learning and Induction Heads , author=. 2022 , journal=

  22. [28]

    Conference on Uncertainty and Artificial Intelligence (UAI) , year=

    Direct and indirect effects , author=. Conference on Uncertainty and Artificial Intelligence (UAI) , year=

  23. [29]

    International Conference on Learning Representations (ICLR) , year=

    Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. International Conference on Learning Representations (ICLR) , year=

  24. [30]

    International Conference on Learning Representations (ICLR) , year=

    Progress measures for grokking via mechanistic interpretability , author=. International Conference on Learning Representations (ICLR) , year=

  25. [31]

    International Conference on Machine Learning (ICML) , year=

    A toy model of universality: Reverse engineering how networks learn group operations , author=. International Conference on Machine Learning (ICML) , year=

  26. [32]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  27. [33]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

  28. [34]

    Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris and Petrov, Michael and Schubert, Ludwig and Voss, Chelsea and Egan, Ben and Lim, Swee Kiat , journal=. Thread:

  29. [35]

    International Conference on Machine Learning (ICML) , year=

    Inducing causal structure for interpretable neural networks , author=. International Conference on Machine Learning (ICML) , year=

  30. [36]

    Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , year=

    Discovering the Compositional Structure of Vector Representations with Role Learning Networks , author=. Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , year=

  31. [38]

    Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , year=

    Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation , author=. Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , year=

  32. [39]

    Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP) , year=

  33. [40]

    Toward transparent

    Casper, Stephen and Rauker, Tilman and Ho, Anson and Hadfield-Menell, Dylan , booktitle=. Toward transparent

  34. [41]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in

    Wang, Kevin Ro and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in

  35. [42]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Language models are few-shot learners , author =. Advances in Neural Information Processing Systems (NeurIPS) , year=

  36. [43]

    Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

    Transformer Feed-Forward Layers Are Key-Value Memories , author=. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

  37. [44]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  38. [45]

    2023 , howpublished =

    Language models can explain neurons in language models , author=. 2023 , howpublished =

  39. [46]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Towards automated circuit discovery for mechanistic interpretability , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  40. [47]

    2022 , journal=

    Causal scrubbing, a method for rigorously testing interpretability hypotheses , author=. 2022 , journal=

  41. [48]

    Locating and editing factual associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and editing factual associations in

  42. [51]

    OpenAI blog , year=

    Language models are unsupervised multitask learners , author=. OpenAI blog , year=

  43. [52]

    Does localization inform editing?

    Hase, Peter and Bansal, Mohit and Kim, Been and Ghandeharioun, Asma , booktitle=. Does localization inform editing?

  44. [53]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Compositional explanations of neurons , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  45. [55]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Interpretability at scale: Identifying causal mechanisms in alpaca , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  46. [56]

    Annual Meeting of the Association for Computational Linguistics (ACL) , year=

    Analyzing Transformers in Embedding Space , author=. Annual Meeting of the Association for Computational Linguistics (ACL) , year=

  47. [58]

    2023 , booktitle =

    Hritik Bansal and Karthik Gopalakrishnan and Saket Dingliwal and Sravan Bodapati and Katrin Kirchhoff and Dan Roth , title =. 2023 , booktitle =

  48. [59]

    Lepori, Michael A and Pavlick, Ellie and Serre, Thomas , journal=. Neuro

  49. [60]

    Annual Meeting of the Association for Computational Linguistics (ACL) , year=

    Knowledge Neurons in Pretrained Transformers , author=. Annual Meeting of the Association for Computational Linguistics (ACL) , year=

  50. [63]

    Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale

    Hritik Bansal, Karthik Gopalakrishnan, Saket Dingliwal, Sravan Bodapati, Katrin Kirchhoff, and Dan Roth. Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023

  51. [64]

    Hidden progress in deep learning: SGD learns parities near the computational limit

    Boaz Barak, Benjamin Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  52. [65]

    Language models can explain neurons in language models

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html , 2023

  53. [66]

    On privileged and convergent bases in neural network representations

    Davis Brown, Nikhil Vyas, and Yamini Bansal. On privileged and convergent bases in neural network representations. arXiv preprint arXiv:2307.12941, 2023

  54. [67]

    Thread: C ircuits

    Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: C ircuits. Distill, 5 0 (3): 0 e24, 2020

  55. [68]

    Toward transparent AI : A survey on interpreting the inner structures of deep neural networks

    Stephen Casper, Tilman Rauker, Anson Ho, and Dylan Hadfield-Menell. Toward transparent AI : A survey on interpreting the inner structures of deep neural networks. In IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 2022

  56. [69]

    Causal scrubbing, a method for rigorously testing interpretability hypotheses

    Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldwosky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing, a method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

  57. [70]

    A toy model of universality: Reverse engineering how networks learn group operations

    Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations. In International Conference on Machine Learning (ICML), 2023

  58. [71]

    Towards automated circuit discovery for mechanistic interpretability

    Arthur Conmy, Augustine N Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  59. [72]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023

  60. [73]

    Knowledge neurons in pretrained transformers

    Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022

  61. [74]

    Analyzing transformers in embedding space

    Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023

  62. [75]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  63. [76]

    Causal analysis of syntactic agreement mechanisms in neural language models

    Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart M Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJC...

  64. [77]

    Neural natural language inference models partially embed theories of lexical entailment and negation

    Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020

  65. [78]

    Causal abstractions of neural networks

    Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  66. [79]

    Inducing causal structure for interpretable neural networks

    Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, and Christopher Potts. Inducing causal structure for interpretable neural networks. In International Conference on Machine Learning (ICML), 2022

  67. [80]

    Finding alignments between interpretable causal variables and distributed neural representations

    Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D Goodman. Finding alignments between interpretable causal variables and distributed neural representations. arXiv preprint arXiv:2303.02536, 2023

  68. [81]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021

  69. [82]

    Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

    Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

  70. [83]

    Dissecting recall of factual associations in auto-regressive language models

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767, 2023

  71. [84]

    Localizing Model Behavior with Path Patching

    Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969, 2023

  72. [85]

    Finding neurons in a haystack: Case studies with sparse probing

    Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023

  73. [86]

    How does GPT -2 compute greater-than?: I nterpreting mathematical abilities in a pre-trained language model

    Michael Hanna, Ollie Liu, and Alexandre Variengien. How does GPT -2 compute greater-than?: I nterpreting mathematical abilities in a pre-trained language model. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  74. [87]

    The out-of-distribution problem in explainability and search methods for feature importance explanations

    Peter Hase, Harry Xie, and Mohit Bansal. The out-of-distribution problem in explainability and search methods for feature importance explanations. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  75. [88]

    Does localization inform editing? S urprising differences in causality-based localization vs

    Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. Does localization inform editing? S urprising differences in causality-based localization vs. knowledge editing in language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  76. [89]

    A circuit for P ython docstrings in a 4-layer attention-only transformer

    Stefan Heimersheim and Jett Janiak. A circuit for P ython docstrings in a 4-layer attention-only transformer. https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only , 2023

  77. [90]

    Natural language descriptions of deep visual features

    Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. In International Conference on Learning Representations (ICLR), 2021

  78. [91]

    A benchmark for interpretability methods in deep neural networks

    Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A benchmark for interpretability methods in deep neural networks. In Advances in neural information processing systems (NeurIPS), 2019

  79. [92]

    Feature relevance quantification in explainable ai: A causal problem

    Dominik Janzing, Lenon Minorics, and Patrick Bl \"o baum. Feature relevance quantification in explainable ai: A causal problem. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020

  80. [93]

    Interpreting transformer's attention dynamic memory and visualizing the semantic information flow of GPT

    Shahar Katz and Yonatan Belinkov. Interpreting transformer's attention dynamic memory and visualizing the semantic information flow of GPT . arXiv preprint arXiv:2305.13417, 2023

Showing first 80 references.