arxiv: 2211.00593 · v1 · submitted 2022-11-01 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 1 theorem link

· Lean Theorem

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Kevin Wang , Alexandre Variengien , Arthur Conmy , Buck Shlegeris , Jacob Steinhardt

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords mechanistic interpretabilityattention circuitsGPT-2indirect object identificationcausal interventionstransformer modelslanguage model behaviorreverse engineering

0 comments

The pith

GPT-2 small solves indirect object identification using a circuit of 26 attention heads in seven classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper maps the full internal mechanism by which GPT-2 small identifies the indirect object in a sentence, such as determining who received an item in a subject-verb-object construction. Researchers isolate 26 specific attention heads grouped into seven functional classes that together carry the necessary information flow. They locate these heads by applying targeted causal interventions that ablate or patch model components and measure the resulting change in task performance. The explanation is checked against three quantitative tests that assess how faithfully the circuit reproduces the original behavior, how completely it accounts for performance, and how minimal the selected set of heads is.

Core claim

GPT-2 small performs indirect object identification by routing information through a specific circuit of 26 attention heads organized into seven main classes, located via systematic causal interventions on attention patterns and residual streams, and shown to satisfy quantitative criteria for faithfulness, completeness, and minimality while leaving some explanatory gaps.

What carries the argument

The IOI circuit: a collection of 26 attention heads divided into seven classes that implement name mover, previous token, and induction heads to track and select the indirect object.

If this is right

Interventions on the 26 heads can be used to predict and control the model's output on indirect object identification examples.
The same causal-intervention workflow can be applied to reverse-engineer other natural language behaviors inside the same model.
Gaps identified by the completeness and minimality checks indicate specific places where additional heads or mechanisms remain to be explained.
The circuit provides a concrete template for scaling mechanistic explanations to larger models and more complex tasks.
Similar circuits may appear in other transformer models that perform comparable syntactic tracking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Editing or removing heads inside the circuit could allow targeted suppression of the indirect-object behavior without broadly disrupting language modeling.
The approach may transfer to understanding how models handle other syntactic dependencies such as subject-verb agreement or coreference resolution.
If circuits of this size prove common, automated search methods for circuits could become practical for routine interpretability work.
The existence of a compact circuit for this task suggests that many natural behaviors may be implemented by relatively sparse subnetworks rather than diffuse whole-model activity.

Load-bearing premise

The three criteria of faithfulness, completeness, and minimality are enough to certify that the identified set of heads forms the complete and minimal explanation rather than one of several circuits that could achieve similar task performance.

What would settle it

Locating a different collection of heads that achieves equal or higher scores on the faithfulness, completeness, and minimality metrics while using fewer heads or different attention patterns would show that the reported circuit is not the minimal explanation.

read the original abstract

Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria--faithfulness, completeness and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, opening opportunities to scale our understanding to both larger models and more complex tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers the first published end-to-end circuit of 26 heads for indirect object identification in GPT-2 small, backed by causal patching, though the three criteria do not fully rule out alternative circuits.

read the letter

The core advance is a concrete circuit of 26 attention heads split into seven functional classes that handles indirect object identification in GPT-2 small. They located the heads through a mix of activation patching and path patching on a curated dataset, then validated the roles with interventions that directly alter model behavior rather than just fitting to outputs. This is new because earlier interpretability work stayed with toy tasks or gave only coarse descriptions of larger models. The quantitative checks are straightforward: faithfulness shows that patching the circuit reproduces the performance drop, completeness shows the circuit recovers most accuracy on its own, and minimality shows that removing any head inside it hurts results further. Those tests provide real evidence that the identified heads matter. The paper is open about shortfalls, noting that the criteria leave gaps such as possible residual contributions from outside the circuit or higher-order interactions not fully tested. A remaining question is whether a different grouping or superset of heads could match the scores on the same data; the current metrics do not exhaustively compare alternatives. The work is aimed at people doing mechanistic interpretability who want a worked example of scaling circuit discovery to natural language. It is worth sending to peer review because the interventions are direct, the task is non-trivial, and the authors have already flagged the open issues rather than overclaiming closure.

Referee Report

3 major / 2 minor

Summary. The paper claims to reverse-engineer the indirect object identification (IOI) task in GPT-2 small by identifying a circuit of 26 attention heads grouped into 7 classes. The circuit is discovered via causal interventions (activation and path patching) on a curated IOI dataset and validated using three quantitative criteria: faithfulness (circuit patching degrades performance), completeness (circuit alone recovers most accuracy), and minimality (ablating any additional circuit head further degrades results). The authors note that the criteria support the explanation but also indicate remaining gaps.

Significance. If the circuit identification holds, the work is significant as the largest end-to-end mechanistic account of a natural language behavior in a transformer. The reliance on causal interventions rather than correlational methods provides direct evidence for head roles, and the explicit use of quantitative criteria (faithfulness, completeness, minimality) sets a replicable standard for future circuit discovery. This bridges small-model toy tasks and broad descriptions of larger models, supporting the feasibility of scaling mechanistic interpretability.

major comments (3)

[Evaluation section / Abstract] The completeness criterion recovers most accuracy via the 26-head circuit, but the manuscript does not quantify the exact residual error attributable to unpatched components or higher-order interactions outside the circuit (see evaluation section and abstract statement on remaining gaps). This leaves open whether the circuit is complete or merely one sufficient subset.
[Minimality tests (quantitative criteria section)] The minimality criterion shows performance degradation when ablating heads inside the circuit, but does not compare against alternative partitions of heads (including some labeled non-circuit) or test whether other subsets achieve statistically indistinguishable faithfulness and completeness scores. This undermines the claim that the identified circuit is the minimal explanation rather than one of several possible circuits.
[Faithfulness evaluation] Faithfulness is demonstrated by patching the circuit, yet the paper does not report variance across patching orders, dataset subsets, or multiple random seeds, nor does it test whether the performance drop is specific to the discovered circuit versus any comparably sized set of heads.

minor comments (2)

[Circuit diagram figure] The diagram of the 7 head classes would benefit from an accompanying table that explicitly lists each class, its heads, and the functional role assigned to it.
[Notation and methods] Notation for attention heads (e.g., layer and index) should be standardized in a single table early in the paper to aid readability when referring to the 26 heads.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We appreciate the recognition of the significance of our work in providing an end-to-end mechanistic account of the IOI task. We have carefully considered each major comment and made revisions to the manuscript to address the concerns about the evaluation criteria. Our responses are detailed below.

read point-by-point responses

Referee: The completeness criterion recovers most accuracy via the 26-head circuit, but the manuscript does not quantify the exact residual error attributable to unpatched components or higher-order interactions outside the circuit (see evaluation section and abstract statement on remaining gaps). This leaves open whether the circuit is complete or merely one sufficient subset.

Authors: We agree that quantifying the residual error more precisely would strengthen the completeness analysis. The abstract already states that the criteria point to remaining gaps, indicating the circuit is sufficient but not necessarily complete. In the revised evaluation section, we have added a detailed analysis of the residual performance, including estimates of contributions from unpatched heads and a discussion of potential higher-order interactions based on further ablation experiments. revision: yes
Referee: The minimality criterion shows performance degradation when ablating heads inside the circuit, but does not compare against alternative partitions of heads (including some labeled non-circuit) or test whether other subsets achieve statistically indistinguishable faithfulness and completeness scores. This undermines the claim that the identified circuit is the minimal explanation rather than one of several possible circuits.

Authors: We acknowledge the value of comparing to alternative partitions for a stronger minimality claim. However, a full search over all possible subsets of heads is computationally intractable. In the revised minimality tests section, we have included comparisons to random subsets of comparable size and to select alternative groupings of heads. These show that our circuit performs better on the minimality criterion than the alternatives tested, supporting our identification while noting that other viable circuits cannot be ruled out without exhaustive search. revision: partial
Referee: Faithfulness is demonstrated by patching the circuit, yet the paper does not report variance across patching orders, dataset subsets, or multiple random seeds, nor does it test whether the performance drop is specific to the discovered circuit versus any comparably sized set of heads.

Authors: We thank the referee for this suggestion to improve the robustness of our faithfulness results. The revised manuscript now reports performance metrics averaged over multiple random seeds and across different dataset subsets, including variance measures. Furthermore, we have added experiments comparing the circuit patching to patching random sets of 26 heads, demonstrating that the performance degradation is substantially larger and more consistent for our discovered circuit than for random selections. revision: yes

Circularity Check

0 steps flagged

No circularity: circuit discovered and validated via independent causal interventions

full rationale

The paper identifies the 26-head IOI circuit through causal interventions (activation and path patching) on GPT-2 small activations, then validates it with faithfulness (patching degrades performance), completeness (circuit recovers accuracy), and minimality (ablating extra heads hurts results) on a curated dataset. These steps are empirical measurements on the model's own behavior rather than any fitted parameter or self-referential definition. No equations reduce a 'prediction' to an input by construction, no uniqueness theorem is imported from self-citations, and no ansatz or renaming occurs. The abstract explicitly notes remaining gaps, confirming the criteria support but do not tautologically define the result. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that causal interventions isolate functional roles without side effects on unrelated computations and that the three evaluation metrics are adequate proxies for a complete mechanistic explanation.

axioms (1)

domain assumption Causal interventions on attention heads reveal their functional roles in the computation
Invoked throughout the intervention experiments described in the abstract

pith-pipeline@v0.9.0 · 5492 in / 1228 out tokens · 31043 ms · 2026-05-13T17:08:49.885716+00:00 · methodology

discussion (0)

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dissecting Jet-Tagger Through Mechanistic Interpretability
hep-ph 2026-05 accept novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
Progress measures for grokking via mechanistic interpretability
cs.LG 2023-01 accept novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
GKnow: Measuring the Entanglement of Gender Bias and Factual Gender
cs.CL 2026-05 unverdicted novelty 7.0

Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
Cell-Based Representation of Relational Binding in Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
Grokking of Diffusion Models: Case Study on Modular Addition
cs.LG 2026-04 unverdicted novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
CURE:Circuit-Aware Unlearning for LLM-based Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

CURE disentangles LLM recommendation circuits into forget-specific, retain-specific, and task-shared modules with tailored update rules to achieve more effective unlearning than weighted baselines.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
How to Interpret Agent Behavior
cs.AI 2026-05 conditional novelty 6.0

ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
cs.AI 2026-05 unverdicted novelty 6.0

Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
Architecture, Not Scale: Circuit Localization in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
Tool Calling is Linearly Readable and Steerable in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions
cs.LG 2026-05 unverdicted novelty 6.0

Future-rhyme information is linearly decodable at line boundaries across model families and strengthens with scale, yet only Gemma-3-27B causally depends on it, with the driver migrating to the boundary around layer 3...
Hallucination Detection via Activations of Open-Weight Proxy Analyzers
cs.CL 2026-05 unverdicted novelty 6.0

A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
The Position Curse: LLMs Struggle to Locate the Last Few Items in a List
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
cs.LG 2026-04 unverdicted novelty 6.0

FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
cs.LG 2026-04 unverdicted novelty 6.0

Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
PhiNet: Speaker Verification with Phonetic Interpretability
eess.AS 2026-04 unverdicted novelty 6.0

PhiNet adds phonetic interpretability to speaker verification while matching the accuracy of standard black-box models on VoxCeleb, SITW, and LibriSpeech.
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
Graph Memory Transformer (GMT)
cs.LG 2026-04 unverdicted novelty 5.0

Graph Memory Transformer (GMT) swaps dense FFN sublayers for a graph of 128 centroids and a learned 128x128 transition matrix per block, yielding a 82M-parameter decoder-only LM that trains stably but trails a 103M de...
Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs
cs.CL 2026-04 unverdicted novelty 5.0

HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions th...
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
cs.LG 2024-08 accept novelty 4.0

Gemma Scope supplies trained sparse autoencoders for all layers of Gemma 2 2B and 9B plus select 27B layers, with public weights and benchmark scores.
Speaking of Language: Reflections on Metalanguage Research in NLP
cs.CL 2026-04 unverdicted novelty 3.0

This reflection paper highlights metalanguage in NLP, links it to LLMs, and lists understudied future directions.
High-Dimensional Statistics: Reflections on Progress and Open Problems
math.ST 2026-05 unverdicted novelty 2.0

A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 28 Pith papers · 3 internal anchors

[3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 1901
[4]

A literature survey of recent advances in chatbots

Guendalina Caldarini, Sardar Jaf, and Kenneth McGarry. A literature survey of recent advances in chatbots. Information, 13 0 (1): 0 41, 2022

work page 2022
[5]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page 2021
[6]

Causal abstractions of neural networks

Atticus Geiger, Hanson Lu, Thomas F Icard, and Christopher Potts. Causal abstractions of neural networks. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=RmuXDtjDhG

work page 2021
[8]

X-risk analysis for ai research

Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research. arXiv, abs/2206.05862, 2022

work page arXiv 2022
[9]

Natural language descriptions of deep visual features

Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. In International Conference on Learning Representations, 2021

work page 2021
[12]

Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019

Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019

work page 2019
[13]

Compositional explanations of neurons

Jesse Mu and Jacob Andreas. Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33: 0 17153--17163, 2020

work page 2020
[14]

A mechanistic interpretability analysis of grokking, 2022

Neel Nanda and Tom Lieberum. A mechanistic interpretability analysis of grokking, 2022. URL https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking

work page 2022
[15]

Mechanistic interpretability, variables, and the importance of interpretable bases

Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. https://www.transformer-circuits.pub/2022/mech-interp-essay, 2022. Accessed: 2022-15-09

work page 2022
[16]

Zoom in: An introduction to circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi:10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in

work page doi:10.23915/distill.00024.001 2020
[18]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[19]

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , shorttitle =

Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, 2022. URL https://arxiv.org/abs/2207.13243

work page arXiv 2022
[20]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp.\ 5998--6008, 2017

work page 2017
[21]

Investigating gender bias in language models using causal mediation analysis

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in Neural Information Processing Systems, 33: 0 12388--12401, 2020

work page 2020
[22]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. ArXiv, abs/2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Shifting machine learning for healthcare from development to deployment and from models to data

Angela Zhang, Lei Xing, James Zou, and Joseph C Wu. Shifting machine learning for healthcare from development to deployment and from models to data. Nature Biomedical Engineering, pp.\ 1--16, 2022

work page 2022
[24]

arXiv preprint arXiv:2106.06087 , year=

Finlayson, Matthew and Mueller, Aaron and Gehrmann, Sebastian and Shieber, Stuart and Linzen, Tal and Belinkov, Yonatan , keywords =. Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2106.06087 , url =

work page doi:10.48550/arxiv.2106.06087 2021
[25]

BERT Rediscovers the Classical NLP Pipeline , publisher =

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , keywords =. BERT Rediscovers the Classical NLP Pipeline , publisher =. 2019 , copyright =. doi:10.48550/ARXIV.1905.05950 , url =

work page doi:10.48550/arxiv.1905.05950 2019
[26]

2019 , subtitle =

Interpretable Machine Learning , author =. 2019 , subtitle =

work page 2019
[27]

Learning to Generate Reviews and Discovering Sentiment

Radford, Alec and Jozefowicz, Rafal and Sutskever, Ilya , keywords =. Learning to Generate Reviews and Discovering Sentiment , publisher =. 2017 , copyright =. doi:10.48550/ARXIV.1704.01444 , url =

work page Pith review doi:10.48550/arxiv.1704.01444 2017
[28]

arXiv preprint arXiv:2106.00737 , year=

Li, Belinda Z. and Nye, Maxwell and Andreas, Jacob , keywords =. Implicit Representations of Meaning in Neural Language Models , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2106.00737 , url =

work page doi:10.48550/arxiv.2106.00737 2021
[29]

Tolga Bolukbasi and Adam Pearce and Ann Yuan and Andy Coenen and Emily Reif and Fernanda B. Vi. An Interpretability Illusion for. CoRR , volume =. 2021 , url =. 2104.07143 , timestamp =

work page arXiv 2021
[30]

BERT Rediscovers the Classical NLP Pipeline

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie. BERT Rediscovers the Classical NLP Pipeline. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1452

work page doi:10.18653/v1/p19-1452 2019
[31]

International Conference on Machine Learning , pages=

Inductive biases and variable creation in self-attention mechanisms , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[32]

arXiv preprint arXiv:2206.04301 , year=

Unveiling Transformers with LEGO: a synthetic reasoning task , author=. arXiv preprint arXiv:2206.04301 , year=

work page arXiv
[33]

Hidden progress in deep learning: Sgd learns parities near the computational limit

Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit , author=. arXiv preprint arXiv:2207.08799 , year=

work page arXiv
[34]

arXiv , year=

X-Risk Analysis for AI Research , author=. arXiv , year=

work page
[35]

In-context Learning and Induction Heads

In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Information , volume=

A literature survey of recent advances in chatbots , author=. Information , volume=. 2022 , publisher=

work page 2022
[37]

Nature Biomedical Engineering , pages=

Shifting machine learning for healthcare from development to deployment and from models to data , author=. Nature Biomedical Engineering , pages=. 2022 , publisher=

work page 2022
[38]

ArXiv , year=

Emergent Abilities of Large Language Models , author=. ArXiv , year=

work page
[39]

2022 , month = jun, journal =

Unsolved problems in ml safety , author=. arXiv preprint arXiv:2109.13916 , year=

work page arXiv
[40]

International Conference on Learning Representations , year=

Natural Language Descriptions of Deep Visual Features , author=. International Conference on Learning Representations , year=

work page
[41]

Advances in Neural Information Processing Systems , volume=

Compositional explanations of neurons , author=. Advances in Neural Information Processing Systems , volume=

work page
[42]

arXiv preprint arXiv:2106.00737 , year=

Implicit representations of meaning in neural language models , author=. arXiv preprint arXiv:2106.00737 , year=

work page arXiv
[43]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Toward a visual concept vocabulary for gan latent space , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[44]

Advances in Neural Information Processing Systems , volume=

Investigating gender bias in language models using causal mediation analysis , author=. Advances in Neural Information Processing Systems , volume=

work page
[45]

arXiv preprint arXiv:2110.07483 , year=

On the pitfalls of analyzing individual neurons in language models , author=. arXiv preprint arXiv:2110.07483 , year=

work page arXiv
[46]

arXiv preprint arXiv:2106.06087 , year=

Causal analysis of syntactic agreement mechanisms in neural language models , author=. arXiv preprint arXiv:2106.06087 , year=

work page arXiv
[47]

ArXiv , year=

On the Opportunities and Risks of Foundation Models , author=. ArXiv , year=

work page
[48]

Advances in neural information processing systems , volume=

Are sixteen heads really better than one? , author=. Advances in neural information processing systems , volume=

work page
[49]

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , shorttitle =

Räuker, Tilman and Ho, Anson and Casper, Stephen and Hadfield-Menell, Dylan , keywords =. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2207.13243 , url =

work page doi:10.48550/arxiv.2207.13243 2022
[50]

Analyzing transformers in embedding space

Analyzing Transformers in Embedding Space , author=. arXiv preprint arXiv:2209.02535 , year=

work page arXiv
[51]

Transformer Feed-Forward Layers Are Key-Value Memories

Transformer feed-forward layers are key-value memories , author=. arXiv preprint arXiv:2012.14913 , year=

work page internal anchor Pith review arXiv 2012
[52]

Locating and Editing Factual Associations in GPT, January 2023

Locating and Editing Factual Associations in GPT , author=. arXiv preprint arXiv:2202.05262 , year=

work page arXiv
[53]

Distill , year =

Sturmfels, Pascal and Lundberg, Scott and Lee, Su-In , title =. Distill , year =

work page
[54]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[55]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[56]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[57]

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases , author=

work page
[58]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021
[59]

A transparency and interpretability tech tree , author=

work page
[60]

Advances in Neural Information Processing Systems , editor=

Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems , editor=. 2021 , url=

work page 2021
[61]

Advances in Neural Information Processing Systems , description =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Advances in Neural Information Processing Systems , description =

work page
[62]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[63]

Distill , year =

Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

work page
[64]

Language Models are Unsupervised Multitask Learners , author=

work page
[65]

Sanity Checks for Saliency Maps , url =

Adebayo, Julius and Gilmer, Justin and Muelly, Michael and Goodfellow, Ian and Hardt, Moritz and Kim, Been , booktitle =. Sanity Checks for Saliency Maps , url =

work page
[66]

GitHub repository , howpublished =

Nanda, Neel , title =. GitHub repository , howpublished =. 2022 , publisher =

work page 2022
[67]

Nanda, Neel and Lieberum, Tom , title =

work page
[68]

nostalgebraist , title =

work page
[69]

2020 , month = dec, journal =

Hubinger, Evan , keywords =. An overview of 11 proposals for building safe advanced AI , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2012.07532 , url =

work page doi:10.48550/arxiv.2012.07532 2020
[70]

A ttention is not E xplanation

Jain, Sarthak and Wallace, Byron C. A ttention is not E xplanation. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1357

work page doi:10.18653/v1/n19-1357 2019
[71]

Optimal Brain Damage , url =

LeCun, Yann and Denker, John and Solla, Sara , booktitle =. Optimal Brain Damage , url =

work page
[72]

2022 , journal=

Softmax Linear Units , author=. 2022 , journal=

work page 2022