arxiv: 2304.05969 · v2 · submitted 2023-04-12 · 💻 cs.LG

Recognition: no theorem link

Localizing Model Behavior with Path Patching

Nicholas Goldowsky-Dill , Chris MacLeod , Lucas Sato , Aryaman Arora

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords path patchingmechanistic interpretabilitycausal localizationinduction headstransformer circuitsactivation interventionsGPT-2 analysis

0 comments

The pith

Path patching lets researchers test whether a neural network's behavior is localized to a specific set of paths through its components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces path patching as a method to express and quantitatively evaluate hypotheses that particular model behaviors arise from interactions along defined paths rather than from the full network. This approach moves beyond qualitative inspection by allowing direct causal interventions that measure how much a hypothesized set of paths contributes to an output. If the method works, it supplies a reproducible way to confirm or refute claims about where computation happens inside transformers and similar models. The authors demonstrate it by refining an existing account of induction heads and by characterizing a behavior in GPT-2. The technique also comes with an open-source implementation to support further experiments of the same kind.

Core claim

Path patching replaces activations along selected paths while leaving other activations unchanged, thereby isolating the causal effect of those paths on the model's output. This provides a quantitative test for the claim that a given behavior is localized to the chosen paths rather than distributed across the network. The method is used to sharpen the description of induction heads and to examine a concrete behavior in GPT-2, showing that the localization hypotheses can be stated and measured with greater precision than before.

What carries the argument

Path patching, an intervention that swaps activations along a hypothesized set of paths to measure their isolated causal contribution to behavior.

If this is right

Researchers can state localization hypotheses in terms of explicit paths and obtain numerical evidence for or against them.
Existing qualitative accounts of induction heads can be refined by measuring the exact contribution of the relevant paths.
The same procedure can be applied to other behaviors in GPT-2 or similar models to produce comparable localization results.
An open implementation lowers the cost of running additional path-patching experiments on new hypotheses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could be extended to compare competing localization hypotheses by pitting their path sets against each other in the same experiment.
If path patching proves reliable, it might serve as a building block for automated search over possible localizations rather than manual hypothesis construction.
Similar path-based interventions could be tried on models outside the transformer family to check whether the localization pattern holds more generally.

Load-bearing premise

Changing activations only along the chosen paths does not create unintended side effects or interactions that would alter the model's behavior through other routes.

What would settle it

Run path patching on a set of paths hypothesized to produce a specific output and observe whether the output changes exactly as predicted while all other model activations remain untouched.

read the original abstract

Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Path patching gives a quantitative handle on localization claims in transformers, but residual-stream isolation needs explicit checks.

read the letter

Colleague, path patching is the main new piece here. It turns localization hypotheses into something you can test by swapping activations only along chosen paths, and they apply it to sharpen the induction-head account plus map out some GPT-2 behavior. Releasing the code is the practical win; it lowers the barrier for anyone who wants to run the same style of experiment without rebuilding the scaffolding. That part is straightforward and useful. The soft spot is the isolation assumption. Because the residual stream feeds multiple components at once, patching a path can still change inputs to heads or MLPs outside the hypothesized set. The abstract does not describe direct measurements of leakage or cross-path interference, so the quantitative claims rest on that assumption holding without side effects. If the full experiments include those controls, the results will be stronger; if not, the method needs tighter validation before it becomes a standard tool. This is aimed at people already doing circuit-level work who need better experimental levers than pure qualitative inspection. The thinking is clear and stays close to existing induction-head literature. It deserves a serious referee to examine the methods section and any ablation checks on residual interactions. I would send it to review.

Referee Report

1 major / 2 minor

Summary. The paper introduces path patching, a technique for expressing and quantitatively testing hypotheses that neural network behaviors are localized to specific sets of paths through components. It applies the method to refine prior explanations of induction heads, characterizes a behavior in GPT-2, and releases an open-source framework for similar experiments.

Significance. If the isolation property holds, path patching would provide a useful quantitative tool for mechanistic interpretability, moving beyond ad-hoc qualitative localization claims. The open-sourced framework is a clear strength that supports reproducibility and extension by others.

major comments (1)

[§3] §3 (Path Patching): The central claim that the intervention isolates causal contributions along hypothesized paths requires that residual-stream interactions with non-path components remain unchanged. No explicit measurement or ablation of cross-path leakage or downstream interference is reported, which is load-bearing for the quantitative evaluation of the induction-head and GPT-2 results.

minor comments (2)

[Abstract] Abstract: The claim that the method 'refines' the induction-head explanation is stated without specifying the concrete change relative to prior work (e.g., what new quantitative evidence is added).
[Experiments] Experiments: Figure captions and tables would benefit from explicit reporting of the exact quantitative metric (e.g., logit difference or accuracy delta) used to assess localization success.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. The concern about explicitly verifying the isolation property is substantive, and we have revised the manuscript to include new measurements addressing cross-path leakage and downstream interference.

read point-by-point responses

Referee: [§3] §3 (Path Patching): The central claim that the intervention isolates causal contributions along hypothesized paths requires that residual-stream interactions with non-path components remain unchanged. No explicit measurement or ablation of cross-path leakage or downstream interference is reported, which is load-bearing for the quantitative evaluation of the induction-head and GPT-2 results.

Authors: We agree that an explicit check on residual-stream interactions strengthens the quantitative claims. Path patching replaces activations only along the hypothesized path while running the remainder of the forward pass on the clean input; by construction this keeps non-path component inputs identical to the clean run except for the direct contributions arriving via the patched path. Nevertheless, to address the referee's point we have added Section 3.4, which reports an ablation measuring L2-norm changes to activations of all non-path components before versus after patching. For the induction-head experiments the median change is below 4% and does not alter the reported effect sizes; analogous results hold for the GPT-2 behavior. We also include a short discussion of why downstream interference is already captured by the path-patching metric itself. These additions are now load-bearing for the revised quantitative claims. revision: yes

Circularity Check

0 steps flagged

Path patching introduced as independent experimental technique with no derivation chain

full rationale

The paper presents path patching as a new methodological tool for expressing and testing localization hypotheses in neural networks. No equations, parameters, or results are derived from prior fitted values or self-referential definitions. The abstract and description frame it as an experimental technique applied to induction heads and GPT-2 behaviors, without any load-bearing self-citations, ansatz smuggling, or renaming of known results as derivations. The central claim rests on the validity of the intervention method itself rather than reducing to its own inputs by construction. This is a standard case of a methods paper with self-contained content against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that network behaviors can be meaningfully localized to paths and that patching operations can isolate those paths. No free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Neural network behaviors can be localized to subsets of paths through components
This is the core hypothesis class the method is designed to test.

invented entities (1)

path patching no independent evidence
purpose: Quantitative test for path-localized behavior hypotheses
New experimental technique introduced in the paper

pith-pipeline@v0.9.0 · 5387 in / 1026 out tokens · 72381 ms · 2026-05-16T19:34:59.863980+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
cs.CL 2026-05 unverdicted novelty 7.0

LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.
In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification
cs.LG 2026-05 conditional novelty 7.0

In-context learning binds model outputs to the demonstrated label tokens as an exhaustive vocabulary, overriding semantic plausibility and causing fixation even with homogeneous or nonsense labels.
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
cs.CL 2026-05 unverdicted novelty 7.0

Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
How Language Models Process Negation
cs.CL 2026-05 unverdicted novelty 7.0

LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
cs.LG 2026-04 unverdicted novelty 7.0

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
cs.AI 2026-05 unverdicted novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Instructions Shape Production of Language, not Processing
cs.CL 2026-05 unverdicted novelty 6.0

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
Patch-Effect Graph Kernels for LLM Interpretability
cs.AI 2026-05 unverdicted novelty 6.0

Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape desc...
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
cs.AI 2026-04 conditional novelty 6.0

Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs
cs.AI 2026-04 unverdicted novelty 6.0

Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a ...
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
cs.LG 2026-04 unverdicted novelty 6.0

Features in deep networks correspond to linear directions of centroids summarizing local functional behavior, enabling sparser and more effective feature dictionaries via sparse autoencoders applied to centroids rathe...
Automated Attention Pattern Discovery at Scale in Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.
Instructions Shape Production of Language, not Processing
cs.CL 2026-05 unverdicted novelty 5.0

Instructions primarily shape the production stage of language models rather than the processing stage, with task-specific information and causal effects stronger in output tokens than input tokens.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
cs.CL 2026-01 unverdicted novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
How to use and interpret activation patching
cs.LG 2024-04 accept novelty 5.0

Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.
High-Dimensional Statistics: Reflections on Progress and Open Problems
math.ST 2026-05 unverdicted novelty 2.0

A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 18 Pith papers · 2 internal anchors

[1]

2022 , eprint=

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , author=. 2022 , eprint=

work page 2022
[2]

2023 , archivePrefix=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. 2023 , archivePrefix=

work page 2023
[3]

ArXiv , year=

In-context Learning and Induction Heads , author=. ArXiv , year=

work page
[4]

Distill , volume=

Zoom in: An introduction to circuits , author=. Distill , volume=

work page
[5]

2023 , eprint=

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations , author=. 2023 , eprint=

work page 2023
[6]

Advances in Neural Information Processing Systems , volume=

Causal abstractions of neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

2022 , journal=

Softmax Linear Units , author=. 2022 , journal=

work page 2022
[9]

Proceedings of the 39th International Conference on Machine Learning , pages =

Inducing Causal Structure for Interpretable Neural Networks , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022
[12]

Advances in neural information processing systems , volume=

Investigating gender bias in language models using causal mediation analysis , author=. Advances in neural information processing systems , volume=

work page
[13]

Advances in Neural Information Processing Systems , volume=

Locating and editing factual associations in gpt , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

Uncertainty in Artificial Intelligence , pages=

Approximate causal abstractions , author=. Uncertainty in Artificial Intelligence , pages=. 2020 , organization=

work page 2020
[16]

Advances in neural information processing systems , volume=

Residual networks behave like ensembles of relatively shallow networks , author=. Advances in neural information processing systems , volume=

work page
[17]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021
[19]

Advances in neural information processing systems , volume=

This looks like that: deep learning for interpretable image recognition , author=. Advances in neural information processing systems , volume=

work page
[20]

2022 , eprint=

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers , author=. 2022 , eprint=

work page 2022
[21]

2023 , eprint=

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , author=. 2023 , eprint=

work page 2023
[22]

2023 , eprint=

Tracr: Compiled Transformers as a Laboratory for Interpretability , author=. 2023 , eprint=

work page 2023
[23]

2023 , eprint=

Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. 2023 , eprint=

work page 2023
[24]

, author=

Causal scrubbing: a method for rigorously testing interpretability hypotheses. , author=. 2022 , url=

work page 2022
[26]

2023 , originalyear =

Scheurer, Jérémy Scheurer and Phil3 and tony and Thibodeau, Jacques and Lindner, David , title =. 2023 , originalyear =

work page 2023
[27]

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence,

Recent Advances in Adversarial Training for Adversarial Robustness , author =. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence,. 2021 , month =. doi:10.24963/ijcai.2021/591 , url =

work page doi:10.24963/ijcai.2021/591 2021
[29]

Approximate causal abstractions

Sander Beckers, Frederick Eberhardt, and Joseph Y Halpern. Approximate causal abstractions. In Uncertainty in Artificial Intelligence, pp.\ 606--615. PMLR, 2020

work page 2020
[30]

Eliciting latent predictions from transformers with the tuned lens, 2023

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens, 2023

work page 2023
[31]

An interpretability illusion for bert

Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Vi \'e gas, and Martin Wattenberg. An interpretability illusion for bert. arXiv preprint arXiv:2104.07143, 2021

work page arXiv 2021
[32]

Causal scrubbing: a method for rigorously testing interpretability hypotheses., 2022

Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing: a method for rigorously testing interpretability hypotheses., 2022. URL https://bit.ly/3WRBhPD

work page 2022
[33]

This looks like that: deep learning for interpretable image recognition

Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K Su. This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32, 2019

work page 2019
[34]

A toy model of universality: Reverse engineering how networks learn group operations, 2023

Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations, 2023

work page 2023
[35]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023

work page 2023
[36]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page 2021
[37]

Softmax linear units

Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav K...

work page 2022
[38]

Causal analysis of syntactic agreement mechanisms in neural language models

Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. arXiv preprint arXiv:2106.06087, 2021

work page arXiv 2021
[39]

Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. arXiv preprint arXiv:2004.14623, 2020

work page arXiv 2004
[40]

Causal abstractions of neural networks

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34: 0 9574--9586, 2021

work page 2021
[41]

Inducing causal structure for interpretable neural networks

Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, and Christopher Potts. Inducing causal structure for interpretable neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volu...

work page 2022
[42]

Causal abstraction for faithful model interpretation

Atticus Geiger, Chris Potts, and Thomas Icard. Causal abstraction for faithful model interpretation. arXiv preprint arXiv:2301.04709, 2023 a

work page arXiv 2023
[43]

Finding alignments between interpretable causal variables and distributed neural representations

Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D Goodman. Finding alignments between interpretable causal variables and distributed neural representations. arXiv preprint arXiv:2303.02536, 2023 b

work page arXiv 2023
[44]

Tracr: Compiled transformers as a laboratory for interpretability, 2023

David Lindner, János Kramár, Matthew Rahtz, Thomas McGrath, and Vladimir Mikulik. Tracr: Compiled transformers as a laboratory for interpretability, 2023

work page 2023
[45]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35: 0 17359--17372, 2022

work page 2022
[46]

Zoom in: An introduction to circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5 0 (3): 0 e00024--001, 2020

work page 2020
[47]

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, T. J. Henighan, Benjamin Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, John Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom B. Brown, Jack Clark, Jared Kaplan, Sam McCand...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Direct and Indirect Effects

Judea Pearl. Direct and indirect effects. CoRR, abs/1301.2300, 2013. URL http://arxiv.org/abs/1301.2300

work page internal anchor Pith review Pith/arXiv arXiv 2013
[49]

Shortformer: Better language modeling using shorter inputs

Ofir Press, Noah A Smith, and Mike Lewis. Shortformer: Better language modeling using shorter inputs. arXiv preprint arXiv:2012.15832, 2020

work page arXiv 2012
[50]

Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, 2023

Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, 2023

work page 2023
[51]

Practical pitfalls of causal scrubbing, 2023

Jérémy Scheurer Scheurer, Phil3, tony, Jacques Thibodeau, and David Lindner. Practical pitfalls of causal scrubbing, 2023. URL https://www.alignmentforum.org/posts/DFarDnQjMnjsKvW8s/practical-pitfalls-of-causal-scrubbing

work page 2023
[52]

Scale efficiently: Insights from pre-training and fine-tuning transformers, 2022

Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale efficiently: Insights from pre-training and fine-tuning transformers, 2022

work page 2022
[53]

Residual networks behave like ensembles of relatively shallow networks

Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016

work page 2016
[54]

Investigating gender bias in language models using causal mediation analysis

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33: 0 12388--12401, 2020

work page 2020
[55]

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022

work page 2022