arxiv: 2309.16042 · v2 · submitted 2023-09-27 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Fred Zhang , Neel Nanda

Authors on Pith no claims yet

Pith reviewed 2026-05-17 11:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords activation patchingmechanistic interpretabilitylanguage modelslocalizationcircuit discoveryevaluation metricscorruption methodsbest practices

0 comments

The pith

Varying metrics and corruption methods in activation patching can produce conflicting pictures of which model components matter.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how small changes in the way activation patching is performed affect results when researchers try to find important parts inside language models. Different choices for scoring the patch effect and for corrupting the original input often point to different components or circuits. A reader should care because activation patching is the main tool for causal localization, so inconsistent outcomes mean that published claims about model mechanisms may depend on unstated methodological decisions. The authors run controlled comparisons across localization and circuit-discovery tasks, give reasons why some metrics and corruption styles are more reliable than others, and close with concrete recommendations for future experiments.

Core claim

In multiple localization and circuit-discovery settings, changing the evaluation metric or the corruption procedure inside activation patching produces noticeably different rankings of important model components. The authors supply empirical evidence for these differences together with conceptual arguments favoring particular metrics and corruption strategies, and they distill the observations into a set of recommended practices for activation patching.

What carries the argument

Activation patching (also called causal tracing), with its choices of evaluation metric and input-corruption method, used to measure the causal importance of model components.

If this is right

Standardizing on a small set of metrics and corruption methods would reduce conflicting localization claims across papers.
Some commonly used metrics give systematically different importance scores than others, so results are not interchangeable.
Reporting results under multiple metric choices would make circuit discoveries more robust.
The recommendations can be adopted immediately to make new localization studies more reproducible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hyperparameter sensitivity may appear in other intervention-based interpretability techniques.
Creating public benchmark suites that measure how much patching results change with metric choice would let the community test the recommendations on new models.
Authors may need to treat the choice of patching variant as an explicit experimental variable rather than a fixed detail.

Load-bearing premise

The localization and circuit-discovery tasks and models the authors tested are representative of how activation patching is used more broadly.

What would settle it

A replication on the same models and tasks that finds the same localization and circuit results no matter which metric or corruption method is chosen would falsify the reported sensitivity.

read the original abstract

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript systematically studies the sensitivity of activation patching (also called causal tracing or interchange intervention) to choices of evaluation metrics and corruption methods when performing localization and circuit discovery in language models. Through experiments in several concrete settings, it reports that different methodological choices can produce inconsistent interpretability conclusions, supplies conceptual arguments for preferring particular metrics or corruption schemes, and distills these observations into concrete recommendations for future use of the technique.

Significance. If the reported sensitivities prove robust, the work addresses a genuine methodological gap: activation patching is widely used yet implemented with little standardization, so explicit guidance on metrics and corruption could improve reproducibility across mechanistic interpretability studies. The empirical comparisons themselves constitute a useful contribution even if the final recommendations require further qualification.

major comments (2)

[§4] §4 (Experimental results on IOI and related tasks): the central empirical claim—that varying metrics and corruption methods yields disparate localization and circuit outcomes—is demonstrated only for the specific models, tasks, and corruption schemes examined. Because the derived best-practice recommendations are presented without additional validation on other architectures, scales, or interpretability targets, the extrapolation from these settings to general usage rests on an untested assumption of representativeness.
[§5] §5 (Recommendations): the preference for certain metrics is supported by a combination of the reported empirical differences and conceptual arguments, yet the manuscript does not quantify the magnitude or statistical reliability of those differences (e.g., via confidence intervals or multiple-run statistics). This makes it difficult to judge whether the observed disparities are large enough to justify changing community practice.

minor comments (2)

[Abstract / §1] The abstract and introduction would benefit from an explicit enumeration of the exact models, tasks, and corruption functions used in the main experiments so that readers can immediately assess the scope of the reported findings.
[§3] Notation for the different patching metrics (e.g., the precise definitions of “direct effect,” “indirect effect,” or normalized variants) should be collected in a single table or subsection for easy reference when comparing results across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the detailed, constructive major comments. We address each point below and have revised the manuscript to incorporate additional discussion and statistical quantification where feasible.

read point-by-point responses

Referee: [§4] §4 (Experimental results on IOI and related tasks): the central empirical claim—that varying metrics and corruption methods yields disparate localization and circuit outcomes—is demonstrated only for the specific models, tasks, and corruption schemes examined. Because the derived best-practice recommendations are presented without additional validation on other architectures, scales, or interpretability targets, the extrapolation from these settings to general usage rests on an untested assumption of representativeness.

Authors: We agree that the experiments are confined to specific, widely studied settings such as the IOI task and smaller-scale models like GPT-2. These were selected because they represent standard benchmarks in the mechanistic interpretability literature where activation patching is commonly applied. The inconsistencies we document even in these canonical cases already underscore the importance of methodological choices. In the revised manuscript we have added an explicit limitations and scope section that discusses the representativeness of our settings, notes that similar sensitivities are likely in other contexts, and recommends that future studies validate the proposed practices on additional architectures and tasks. We do not claim the recommendations are universally proven but present them as evidence-based guidance for prevalent use cases. revision: partial
Referee: [§5] §5 (Recommendations): the preference for certain metrics is supported by a combination of the reported empirical differences and conceptual arguments, yet the manuscript does not quantify the magnitude or statistical reliability of those differences (e.g., via confidence intervals or multiple-run statistics). This makes it difficult to judge whether the observed disparities are large enough to justify changing community practice.

Authors: We concur that quantifying the scale and reliability of the observed differences would strengthen the case for the recommendations. The revised version now includes results from multiple random seeds for the primary experiments, with error bars and approximate confidence intervals reported for key localization and circuit metrics. These additions allow readers to evaluate whether the disparities are sufficiently consistent and large to warrant changes in practice. The conceptual arguments remain as before but are now paired with this statistical context. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons and recommendations are self-contained

full rationale

The paper performs direct empirical experiments comparing activation patching metrics, corruption methods, and hyperparameters across specific localization and circuit discovery tasks in language models. It reports observed disparities in results and offers recommendations backed by those observations plus conceptual arguments, without any derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claims to prior inputs by construction. The analysis stands on its own experimental data rather than self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical methodological study. It introduces no new free parameters, axioms, or invented entities; it relies on standard assumptions from the mechanistic interpretability literature about what activation patching measures.

pith-pipeline@v0.9.0 · 5432 in / 968 out tokens · 44081 ms · 2026-05-17T11:52:22.351648+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
Do Audio-Visual Large Language Models Really See and Hear?
cs.AI 2026-04 unverdicted novelty 8.0

AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
cs.AI 2026-05 unverdicted novelty 7.0

Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
Data-driven Circuit Discovery for Interpretability of Language Models
cs.AI 2026-05 unverdicted novelty 7.0

Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
Cell-Based Representation of Relational Binding in Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
cs.CR 2026-04 unverdicted novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
CURE:Circuit-Aware Unlearning for LLM-based Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

CURE disentangles LLM recommendation circuits into forget-specific, retain-specific, and task-shared modules with tailored update rules to achieve more effective unlearning than weighted baselines.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Knowledge Vector of Logical Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Distinct linear knowledge vectors for deductive, inductive, and abductive reasoning in LLMs can be refined via complementary subspace constraints to improve performance through mutual knowledge sharing.
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
cs.LG 2026-04 unverdicted novelty 6.0

LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
Understanding the Mechanism of Altruism in Large Language Models
econ.GN 2026-04 unverdicted novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task
cs.LG 2026-04 unverdicted novelty 6.0

Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
From Attribution to Action: A Human-Centered Application of Activation Steering
cs.AI 2026-04 unverdicted novelty 6.0

Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...
How to use and interpret activation patching
cs.LG 2024-04 accept novelty 5.0

Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · cited by 16 Pith papers · 3 internal anchors

[1]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[2]

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases , author=

work page
[3]

Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors

Yun, Zeyu and Chen, Yubei and Olshausen, Bruno and LeCun, Yann. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

work page
[5]

International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

Feature relevance quantification in explainable AI: A causal problem , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

work page
[6]

A circuit for

Heimersheim, Stefan and Janiak, Jett , year =. A circuit for

work page
[7]

Transformer

Nanda, Neel and Bloom, Joseph , howpublished = ". Transformer

work page
[8]

Wang, Ben and Komatsuzaki, Aran , title =

work page
[9]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Investigating gender bias in language models using causal mediation analysis , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[10]

arXiv preprint arXiv:2307.03637 , year=

Discovering Variable Binding Circuitry with Desiderata , author=. arXiv preprint arXiv:2307.03637 , year=

work page arXiv
[11]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , author=. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page
[12]

How does

Hanna, Michael and Liu, Ollie and Variengien, Alexandre , booktitle=. How does

work page
[13]

International Conference on Machine Learning (ICML) , year=

How do transformers learn topic structure: Towards a mechanistic understanding , author=. International Conference on Machine Learning (ICML) , year=

work page
[14]

Wen, Kaiyue and Li, Yuchen and Liu, Bingbin and Risteski, Andrej , booktitle=. (

work page
[15]

International Conference on Learning Representations (ICLR) , year=

Natural language descriptions of deep visual features , author=. International Conference on Learning Representations (ICLR) , year=

work page
[16]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[17]

Hidden progress in deep learning:

Barak, Boaz and Edelman, Benjamin and Goel, Surbhi and Kakade, Sham and Malach, Eran and Zhang, Cyril , booktitle=. Hidden progress in deep learning:

work page
[18]

Advances in Neural Information Processing Systems (NeurIPS) , year=

The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[23]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of

Katz, Shahar and Belinkov, Yonatan , journal=. Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of

work page
[25]

Advances in neural information processing systems (NeurIPS) , year=

A benchmark for interpretability methods in deep neural networks , author=. Advances in neural information processing systems (NeurIPS) , year=

work page
[26]

2022 , journal=

In-context Learning and Induction Heads , author=. 2022 , journal=

work page 2022
[28]

Conference on Uncertainty and Artificial Intelligence (UAI) , year=

Direct and indirect effects , author=. Conference on Uncertainty and Artificial Intelligence (UAI) , year=

work page
[29]

International Conference on Learning Representations (ICLR) , year=

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. International Conference on Learning Representations (ICLR) , year=

work page
[30]

International Conference on Learning Representations (ICLR) , year=

Progress measures for grokking via mechanistic interpretability , author=. International Conference on Learning Representations (ICLR) , year=

work page
[31]

International Conference on Machine Learning (ICML) , year=

A toy model of universality: Reverse engineering how networks learn group operations , author=. International Conference on Machine Learning (ICML) , year=

work page
[32]

Advances in Neural Information Processing Systems (NeurIPS) , year=

The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[33]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021
[34]

Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris and Petrov, Michael and Schubert, Ludwig and Voss, Chelsea and Egan, Ben and Lim, Swee Kiat , journal=. Thread:

work page
[35]

International Conference on Machine Learning (ICML) , year=

Inducing causal structure for interpretable neural networks , author=. International Conference on Machine Learning (ICML) , year=

work page
[36]

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , year=

Discovering the Compositional Structure of Vector Representations with Role Learning Networks , author=. Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , year=

work page
[38]

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , year=

Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation , author=. Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , year=

work page
[39]

Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP) , year=

work page
[40]

Toward transparent

Casper, Stephen and Rauker, Tilman and Ho, Anson and Hadfield-Menell, Dylan , booktitle=. Toward transparent

work page
[41]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Wang, Kevin Ro and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in

work page
[42]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Language models are few-shot learners , author =. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[43]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page
[44]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[45]

2023 , howpublished =

Language models can explain neurons in language models , author=. 2023 , howpublished =

work page 2023
[46]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Towards automated circuit discovery for mechanistic interpretability , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[47]

2022 , journal=

Causal scrubbing, a method for rigorously testing interpretability hypotheses , author=. 2022 , journal=

work page 2022
[48]

Locating and editing factual associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and editing factual associations in

work page
[51]

OpenAI blog , year=

Language models are unsupervised multitask learners , author=. OpenAI blog , year=

work page
[52]

Does localization inform editing?

Hase, Peter and Bansal, Mohit and Kim, Been and Ghandeharioun, Asma , booktitle=. Does localization inform editing?

work page
[53]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Compositional explanations of neurons , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[55]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Interpretability at scale: Identifying causal mechanisms in alpaca , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[56]

Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Analyzing Transformers in Embedding Space , author=. Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page
[58]

2023 , booktitle =

Hritik Bansal and Karthik Gopalakrishnan and Saket Dingliwal and Sravan Bodapati and Katrin Kirchhoff and Dan Roth , title =. 2023 , booktitle =

work page 2023
[59]

Lepori, Michael A and Pavlick, Ellie and Serre, Thomas , journal=. Neuro

work page
[60]

Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Knowledge Neurons in Pretrained Transformers , author=. Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page
[63]

Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale

Hritik Bansal, Karthik Gopalakrishnan, Saket Dingliwal, Sravan Bodapati, Katrin Kirchhoff, and Dan Roth. Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023

work page 2023
[64]

Hidden progress in deep learning: SGD learns parities near the computational limit

Boaz Barak, Benjamin Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[65]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html , 2023

work page 2023
[66]

On privileged and convergent bases in neural network representations

Davis Brown, Nikhil Vyas, and Yamini Bansal. On privileged and convergent bases in neural network representations. arXiv preprint arXiv:2307.12941, 2023

work page arXiv 2023
[67]

Thread: C ircuits

Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: C ircuits. Distill, 5 0 (3): 0 e24, 2020

work page 2020
[68]

Toward transparent AI : A survey on interpreting the inner structures of deep neural networks

Stephen Casper, Tilman Rauker, Anson Ho, and Dylan Hadfield-Menell. Toward transparent AI : A survey on interpreting the inner structures of deep neural networks. In IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 2022

work page 2022
[69]

Causal scrubbing, a method for rigorously testing interpretability hypotheses

Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldwosky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing, a method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

work page 2022
[70]

A toy model of universality: Reverse engineering how networks learn group operations

Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations. In International Conference on Machine Learning (ICML), 2023

work page 2023
[71]

Towards automated circuit discovery for mechanistic interpretability

Arthur Conmy, Augustine N Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[72]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Knowledge neurons in pretrained transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022

work page 2022
[74]

Analyzing transformers in embedding space

Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023

work page 2023
[75]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page 2021
[76]

Causal analysis of syntactic agreement mechanisms in neural language models

Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart M Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJC...

work page 2021
[77]

Neural natural language inference models partially embed theories of lexical entailment and negation

Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020

work page 2020
[78]

Causal abstractions of neural networks

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[79]

Inducing causal structure for interpretable neural networks

Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, and Christopher Potts. Inducing causal structure for interpretable neural networks. In International Conference on Machine Learning (ICML), 2022

work page 2022
[80]

Finding alignments between interpretable causal variables and distributed neural representations

Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D Goodman. Finding alignments between interpretable causal variables and distributed neural representations. arXiv preprint arXiv:2303.02536, 2023

work page arXiv 2023
[81]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021

work page 2021
[82]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

work page 2022
[83]

Dissecting recall of factual associations in auto-regressive language models

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767, 2023

work page arXiv 2023
[84]

Localizing Model Behavior with Path Patching

Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[85]

Finding neurons in a haystack: Case studies with sparse probing

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023

work page arXiv 2023
[86]

How does GPT -2 compute greater-than?: I nterpreting mathematical abilities in a pre-trained language model

Michael Hanna, Ollie Liu, and Alexandre Variengien. How does GPT -2 compute greater-than?: I nterpreting mathematical abilities in a pre-trained language model. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[87]

The out-of-distribution problem in explainability and search methods for feature importance explanations

Peter Hase, Harry Xie, and Mohit Bansal. The out-of-distribution problem in explainability and search methods for feature importance explanations. In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[88]

Does localization inform editing? S urprising differences in causality-based localization vs

Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. Does localization inform editing? S urprising differences in causality-based localization vs. knowledge editing in language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[89]

A circuit for P ython docstrings in a 4-layer attention-only transformer

Stefan Heimersheim and Jett Janiak. A circuit for P ython docstrings in a 4-layer attention-only transformer. https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only , 2023

work page 2023
[90]

Natural language descriptions of deep visual features

Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. In International Conference on Learning Representations (ICLR), 2021

work page 2021
[91]

A benchmark for interpretability methods in deep neural networks

Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A benchmark for interpretability methods in deep neural networks. In Advances in neural information processing systems (NeurIPS), 2019

work page 2019
[92]

Feature relevance quantification in explainable ai: A causal problem

Dominik Janzing, Lenon Minorics, and Patrick Bl \"o baum. Feature relevance quantification in explainable ai: A causal problem. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020

work page 2020
[93]

Interpreting transformer's attention dynamic memory and visualizing the semantic information flow of GPT

Shahar Katz and Yonatan Belinkov. Interpreting transformer's attention dynamic memory and visualizing the semantic information flow of GPT . arXiv preprint arXiv:2305.13417, 2023

work page arXiv 2023

Showing first 80 references.