pith. machine review for the scientific record. sign in

arxiv: 2211.00593 · v1 · submitted 2022-11-01 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 1 theorem link

· Lean Theorem

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords mechanistic interpretabilityattention circuitsGPT-2indirect object identificationcausal interventionstransformer modelslanguage model behaviorreverse engineering
0
0 comments X

The pith

GPT-2 small solves indirect object identification using a circuit of 26 attention heads in seven classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper maps the full internal mechanism by which GPT-2 small identifies the indirect object in a sentence, such as determining who received an item in a subject-verb-object construction. Researchers isolate 26 specific attention heads grouped into seven functional classes that together carry the necessary information flow. They locate these heads by applying targeted causal interventions that ablate or patch model components and measure the resulting change in task performance. The explanation is checked against three quantitative tests that assess how faithfully the circuit reproduces the original behavior, how completely it accounts for performance, and how minimal the selected set of heads is.

Core claim

GPT-2 small performs indirect object identification by routing information through a specific circuit of 26 attention heads organized into seven main classes, located via systematic causal interventions on attention patterns and residual streams, and shown to satisfy quantitative criteria for faithfulness, completeness, and minimality while leaving some explanatory gaps.

What carries the argument

The IOI circuit: a collection of 26 attention heads divided into seven classes that implement name mover, previous token, and induction heads to track and select the indirect object.

If this is right

  • Interventions on the 26 heads can be used to predict and control the model's output on indirect object identification examples.
  • The same causal-intervention workflow can be applied to reverse-engineer other natural language behaviors inside the same model.
  • Gaps identified by the completeness and minimality checks indicate specific places where additional heads or mechanisms remain to be explained.
  • The circuit provides a concrete template for scaling mechanistic explanations to larger models and more complex tasks.
  • Similar circuits may appear in other transformer models that perform comparable syntactic tracking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Editing or removing heads inside the circuit could allow targeted suppression of the indirect-object behavior without broadly disrupting language modeling.
  • The approach may transfer to understanding how models handle other syntactic dependencies such as subject-verb agreement or coreference resolution.
  • If circuits of this size prove common, automated search methods for circuits could become practical for routine interpretability work.
  • The existence of a compact circuit for this task suggests that many natural behaviors may be implemented by relatively sparse subnetworks rather than diffuse whole-model activity.

Load-bearing premise

The three criteria of faithfulness, completeness, and minimality are enough to certify that the identified set of heads forms the complete and minimal explanation rather than one of several circuits that could achieve similar task performance.

What would settle it

Locating a different collection of heads that achieves equal or higher scores on the faithfulness, completeness, and minimality metrics while using fewer heads or different attention patterns would show that the reported circuit is not the minimal explanation.

read the original abstract

Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria--faithfulness, completeness and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, opening opportunities to scale our understanding to both larger models and more complex tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to reverse-engineer the indirect object identification (IOI) task in GPT-2 small by identifying a circuit of 26 attention heads grouped into 7 classes. The circuit is discovered via causal interventions (activation and path patching) on a curated IOI dataset and validated using three quantitative criteria: faithfulness (circuit patching degrades performance), completeness (circuit alone recovers most accuracy), and minimality (ablating any additional circuit head further degrades results). The authors note that the criteria support the explanation but also indicate remaining gaps.

Significance. If the circuit identification holds, the work is significant as the largest end-to-end mechanistic account of a natural language behavior in a transformer. The reliance on causal interventions rather than correlational methods provides direct evidence for head roles, and the explicit use of quantitative criteria (faithfulness, completeness, minimality) sets a replicable standard for future circuit discovery. This bridges small-model toy tasks and broad descriptions of larger models, supporting the feasibility of scaling mechanistic interpretability.

major comments (3)
  1. [Evaluation section / Abstract] The completeness criterion recovers most accuracy via the 26-head circuit, but the manuscript does not quantify the exact residual error attributable to unpatched components or higher-order interactions outside the circuit (see evaluation section and abstract statement on remaining gaps). This leaves open whether the circuit is complete or merely one sufficient subset.
  2. [Minimality tests (quantitative criteria section)] The minimality criterion shows performance degradation when ablating heads inside the circuit, but does not compare against alternative partitions of heads (including some labeled non-circuit) or test whether other subsets achieve statistically indistinguishable faithfulness and completeness scores. This undermines the claim that the identified circuit is the minimal explanation rather than one of several possible circuits.
  3. [Faithfulness evaluation] Faithfulness is demonstrated by patching the circuit, yet the paper does not report variance across patching orders, dataset subsets, or multiple random seeds, nor does it test whether the performance drop is specific to the discovered circuit versus any comparably sized set of heads.
minor comments (2)
  1. [Circuit diagram figure] The diagram of the 7 head classes would benefit from an accompanying table that explicitly lists each class, its heads, and the functional role assigned to it.
  2. [Notation and methods] Notation for attention heads (e.g., layer and index) should be standardized in a single table early in the paper to aid readability when referring to the 26 heads.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We appreciate the recognition of the significance of our work in providing an end-to-end mechanistic account of the IOI task. We have carefully considered each major comment and made revisions to the manuscript to address the concerns about the evaluation criteria. Our responses are detailed below.

read point-by-point responses
  1. Referee: The completeness criterion recovers most accuracy via the 26-head circuit, but the manuscript does not quantify the exact residual error attributable to unpatched components or higher-order interactions outside the circuit (see evaluation section and abstract statement on remaining gaps). This leaves open whether the circuit is complete or merely one sufficient subset.

    Authors: We agree that quantifying the residual error more precisely would strengthen the completeness analysis. The abstract already states that the criteria point to remaining gaps, indicating the circuit is sufficient but not necessarily complete. In the revised evaluation section, we have added a detailed analysis of the residual performance, including estimates of contributions from unpatched heads and a discussion of potential higher-order interactions based on further ablation experiments. revision: yes

  2. Referee: The minimality criterion shows performance degradation when ablating heads inside the circuit, but does not compare against alternative partitions of heads (including some labeled non-circuit) or test whether other subsets achieve statistically indistinguishable faithfulness and completeness scores. This undermines the claim that the identified circuit is the minimal explanation rather than one of several possible circuits.

    Authors: We acknowledge the value of comparing to alternative partitions for a stronger minimality claim. However, a full search over all possible subsets of heads is computationally intractable. In the revised minimality tests section, we have included comparisons to random subsets of comparable size and to select alternative groupings of heads. These show that our circuit performs better on the minimality criterion than the alternatives tested, supporting our identification while noting that other viable circuits cannot be ruled out without exhaustive search. revision: partial

  3. Referee: Faithfulness is demonstrated by patching the circuit, yet the paper does not report variance across patching orders, dataset subsets, or multiple random seeds, nor does it test whether the performance drop is specific to the discovered circuit versus any comparably sized set of heads.

    Authors: We thank the referee for this suggestion to improve the robustness of our faithfulness results. The revised manuscript now reports performance metrics averaged over multiple random seeds and across different dataset subsets, including variance measures. Furthermore, we have added experiments comparing the circuit patching to patching random sets of 26 heads, demonstrating that the performance degradation is substantially larger and more consistent for our discovered circuit than for random selections. revision: yes

Circularity Check

0 steps flagged

No circularity: circuit discovered and validated via independent causal interventions

full rationale

The paper identifies the 26-head IOI circuit through causal interventions (activation and path patching) on GPT-2 small activations, then validates it with faithfulness (patching degrades performance), completeness (circuit recovers accuracy), and minimality (ablating extra heads hurts results) on a curated dataset. These steps are empirical measurements on the model's own behavior rather than any fitted parameter or self-referential definition. No equations reduce a 'prediction' to an input by construction, no uniqueness theorem is imported from self-citations, and no ansatz or renaming occurs. The abstract explicitly notes remaining gaps, confirming the criteria support but do not tautologically define the result. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that causal interventions isolate functional roles without side effects on unrelated computations and that the three evaluation metrics are adequate proxies for a complete mechanistic explanation.

axioms (1)
  • domain assumption Causal interventions on attention heads reveal their functional roles in the computation
    Invoked throughout the intervention experiments described in the abstract

pith-pipeline@v0.9.0 · 5492 in / 1228 out tokens · 31043 ms · 2026-05-13T17:08:49.885716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dissecting Jet-Tagger Through Mechanistic Interpretability

    hep-ph 2026-05 accept novelty 8.0

    A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.

  2. Progress measures for grokking via mechanistic interpretability

    cs.LG 2023-01 accept novelty 8.0

    Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.

  3. GKnow: Measuring the Entanglement of Gender Bias and Factual Gender

    cs.CL 2026-05 unverdicted novelty 7.0

    Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.

  4. Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

    cs.CL 2026-05 unverdicted novelty 7.0

    Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

  5. Cell-Based Representation of Relational Binding in Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...

  6. Grokking of Diffusion Models: Case Study on Modular Addition

    cs.LG 2026-04 unverdicted novelty 7.0

    Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

  7. CURE:Circuit-Aware Unlearning for LLM-based Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    CURE disentangles LLM recommendation circuits into forget-specific, retain-specific, and task-shared modules with tailored update rules to achieve more effective unlearning than weighted baselines.

  8. Eliciting Latent Predictions from Transformers with the Tuned Lens

    cs.LG 2023-03 accept novelty 7.0

    Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

  9. How to Interpret Agent Behavior

    cs.AI 2026-05 conditional novelty 6.0

    ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.

  10. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  11. Not How Many, But Which: Parameter Placement in Low-Rank Adaptation

    cs.LG 2026-05 unverdicted novelty 6.0

    Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.

  12. The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

    cs.AI 2026-05 unverdicted novelty 6.0

    Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.

  13. Architecture, Not Scale: Circuit Localization in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.

  14. Tool Calling is Linearly Readable and Steerable in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

  15. Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

    cs.LG 2026-05 unverdicted novelty 6.0

    Future-rhyme information is linearly decodable at line boundaries across model families and strengthens with scale, yet only Gemma-3-27B causally depends on it, with the driver migrating to the boundary around layer 3...

  16. Hallucination Detection via Activations of Open-Weight Proxy Analyzers

    cs.CL 2026-05 unverdicted novelty 6.0

    A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.

  17. The Position Curse: LLMs Struggle to Locate the Last Few Items in a List

    cs.LG 2026-05 unverdicted novelty 6.0

    LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.

  18. When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.

  19. Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...

  20. The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.

  21. Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

    cs.LG 2026-04 unverdicted novelty 6.0

    Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.

  22. PhiNet: Speaker Verification with Phonetic Interpretability

    eess.AS 2026-04 unverdicted novelty 6.0

    PhiNet adds phonetic interpretability to speaker verification while matching the accuracy of standard black-box models on VoxCeleb, SITW, and LibriSpeech.

  23. Negative Before Positive: Asymmetric Valence Processing in Large Language Models

    cs.CL 2026-05 unverdicted novelty 5.0

    Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.

  24. Graph Memory Transformer (GMT)

    cs.LG 2026-04 unverdicted novelty 5.0

    Graph Memory Transformer (GMT) swaps dense FFN sublayers for a graph of 128 centroids and a learned 128x128 transition matrix per block, yielding a 82M-parameter decoder-only LM that trains stably but trails a 103M de...

  25. Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions th...

  26. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

    cs.LG 2024-08 accept novelty 4.0

    Gemma Scope supplies trained sparse autoencoders for all layers of Gemma 2 2B and 9B plus select 27B layers, with public weights and benchmark scores.

  27. Speaking of Language: Reflections on Metalanguage Research in NLP

    cs.CL 2026-04 unverdicted novelty 3.0

    This reflection paper highlights metalanguage in NLP, links it to LLMs, and lists understudied future directions.

  28. High-Dimensional Statistics: Reflections on Progress and Open Problems

    math.ST 2026-05 unverdicted novelty 2.0

    A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 28 Pith papers · 3 internal anchors

  1. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  2. [4]

    A literature survey of recent advances in chatbots

    Guendalina Caldarini, Sardar Jaf, and Kenneth McGarry. A literature survey of recent advances in chatbots. Information, 13 0 (1): 0 41, 2022

  3. [5]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  4. [6]

    Causal abstractions of neural networks

    Atticus Geiger, Hanson Lu, Thomas F Icard, and Christopher Potts. Causal abstractions of neural networks. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=RmuXDtjDhG

  5. [8]

    X-risk analysis for ai research

    Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research. arXiv, abs/2206.05862, 2022

  6. [9]

    Natural language descriptions of deep visual features

    Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. In International Conference on Learning Representations, 2021

  7. [12]

    Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019

    Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019

  8. [13]

    Compositional explanations of neurons

    Jesse Mu and Jacob Andreas. Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33: 0 17153--17163, 2020

  9. [14]

    A mechanistic interpretability analysis of grokking, 2022

    Neel Nanda and Tom Lieberum. A mechanistic interpretability analysis of grokking, 2022. URL https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking

  10. [15]

    Mechanistic interpretability, variables, and the importance of interpretable bases

    Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. https://www.transformer-circuits.pub/2022/mech-interp-essay, 2022. Accessed: 2022-15-09

  11. [16]

    Zoom in: An introduction to circuits

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi:10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in

  12. [18]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  13. [19]

    Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , shorttitle =

    Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, 2022. URL https://arxiv.org/abs/2207.13243

  14. [20]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp.\ 5998--6008, 2017

  15. [21]

    Investigating gender bias in language models using causal mediation analysis

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in Neural Information Processing Systems, 33: 0 12388--12401, 2020

  16. [22]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. ArXiv, abs/2206.07682, 2022

  17. [23]

    Shifting machine learning for healthcare from development to deployment and from models to data

    Angela Zhang, Lei Xing, James Zou, and Joseph C Wu. Shifting machine learning for healthcare from development to deployment and from models to data. Nature Biomedical Engineering, pp.\ 1--16, 2022

  18. [24]

    arXiv preprint arXiv:2106.06087 , year=

    Finlayson, Matthew and Mueller, Aaron and Gehrmann, Sebastian and Shieber, Stuart and Linzen, Tal and Belinkov, Yonatan , keywords =. Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2106.06087 , url =

  19. [25]

    BERT Rediscovers the Classical NLP Pipeline , publisher =

    Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , keywords =. BERT Rediscovers the Classical NLP Pipeline , publisher =. 2019 , copyright =. doi:10.48550/ARXIV.1905.05950 , url =

  20. [26]

    2019 , subtitle =

    Interpretable Machine Learning , author =. 2019 , subtitle =

  21. [27]

    Learning to Generate Reviews and Discovering Sentiment

    Radford, Alec and Jozefowicz, Rafal and Sutskever, Ilya , keywords =. Learning to Generate Reviews and Discovering Sentiment , publisher =. 2017 , copyright =. doi:10.48550/ARXIV.1704.01444 , url =

  22. [28]

    arXiv preprint arXiv:2106.00737 , year=

    Li, Belinda Z. and Nye, Maxwell and Andreas, Jacob , keywords =. Implicit Representations of Meaning in Neural Language Models , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2106.00737 , url =

  23. [29]

    Tolga Bolukbasi and Adam Pearce and Ann Yuan and Andy Coenen and Emily Reif and Fernanda B. Vi. An Interpretability Illusion for. CoRR , volume =. 2021 , url =. 2104.07143 , timestamp =

  24. [30]

    BERT Rediscovers the Classical NLP Pipeline

    Tenney, Ian and Das, Dipanjan and Pavlick, Ellie. BERT Rediscovers the Classical NLP Pipeline. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1452

  25. [31]

    International Conference on Machine Learning , pages=

    Inductive biases and variable creation in self-attention mechanisms , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  26. [32]

    arXiv preprint arXiv:2206.04301 , year=

    Unveiling Transformers with LEGO: a synthetic reasoning task , author=. arXiv preprint arXiv:2206.04301 , year=

  27. [33]

    Hidden progress in deep learning: Sgd learns parities near the computational limit

    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit , author=. arXiv preprint arXiv:2207.08799 , year=

  28. [34]

    arXiv , year=

    X-Risk Analysis for AI Research , author=. arXiv , year=

  29. [35]

    In-context Learning and Induction Heads

    In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

  30. [36]

    Information , volume=

    A literature survey of recent advances in chatbots , author=. Information , volume=. 2022 , publisher=

  31. [37]

    Nature Biomedical Engineering , pages=

    Shifting machine learning for healthcare from development to deployment and from models to data , author=. Nature Biomedical Engineering , pages=. 2022 , publisher=

  32. [38]

    ArXiv , year=

    Emergent Abilities of Large Language Models , author=. ArXiv , year=

  33. [39]

    2022 , month = jun, journal =

    Unsolved problems in ml safety , author=. arXiv preprint arXiv:2109.13916 , year=

  34. [40]

    International Conference on Learning Representations , year=

    Natural Language Descriptions of Deep Visual Features , author=. International Conference on Learning Representations , year=

  35. [41]

    Advances in Neural Information Processing Systems , volume=

    Compositional explanations of neurons , author=. Advances in Neural Information Processing Systems , volume=

  36. [42]

    arXiv preprint arXiv:2106.00737 , year=

    Implicit representations of meaning in neural language models , author=. arXiv preprint arXiv:2106.00737 , year=

  37. [43]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Toward a visual concept vocabulary for gan latent space , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  38. [44]

    Advances in Neural Information Processing Systems , volume=

    Investigating gender bias in language models using causal mediation analysis , author=. Advances in Neural Information Processing Systems , volume=

  39. [45]

    arXiv preprint arXiv:2110.07483 , year=

    On the pitfalls of analyzing individual neurons in language models , author=. arXiv preprint arXiv:2110.07483 , year=

  40. [46]

    arXiv preprint arXiv:2106.06087 , year=

    Causal analysis of syntactic agreement mechanisms in neural language models , author=. arXiv preprint arXiv:2106.06087 , year=

  41. [47]

    ArXiv , year=

    On the Opportunities and Risks of Foundation Models , author=. ArXiv , year=

  42. [48]

    Advances in neural information processing systems , volume=

    Are sixteen heads really better than one? , author=. Advances in neural information processing systems , volume=

  43. [49]

    Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , shorttitle =

    Räuker, Tilman and Ho, Anson and Casper, Stephen and Hadfield-Menell, Dylan , keywords =. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2207.13243 , url =

  44. [50]

    Analyzing transformers in embedding space

    Analyzing Transformers in Embedding Space , author=. arXiv preprint arXiv:2209.02535 , year=

  45. [51]

    Transformer Feed-Forward Layers Are Key-Value Memories

    Transformer feed-forward layers are key-value memories , author=. arXiv preprint arXiv:2012.14913 , year=

  46. [52]

    Locating and Editing Factual Associations in GPT, January 2023

    Locating and Editing Factual Associations in GPT , author=. arXiv preprint arXiv:2202.05262 , year=

  47. [53]

    Distill , year =

    Sturmfels, Pascal and Lundberg, Scott and Lee, Su-In , title =. Distill , year =

  48. [54]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  49. [55]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  50. [56]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  51. [57]

    Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases , author=

  52. [58]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

  53. [59]

    A transparency and interpretability tech tree , author=

  54. [60]

    Advances in Neural Information Processing Systems , editor=

    Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems , editor=. 2021 , url=

  55. [61]

    Advances in Neural Information Processing Systems , description =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Advances in Neural Information Processing Systems , description =

  56. [62]

    Language Models are Few-Shot Learners , url =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  57. [63]

    Distill , year =

    Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

  58. [64]

    Language Models are Unsupervised Multitask Learners , author=

  59. [65]

    Sanity Checks for Saliency Maps , url =

    Adebayo, Julius and Gilmer, Justin and Muelly, Michael and Goodfellow, Ian and Hardt, Moritz and Kim, Been , booktitle =. Sanity Checks for Saliency Maps , url =

  60. [66]

    GitHub repository , howpublished =

    Nanda, Neel , title =. GitHub repository , howpublished =. 2022 , publisher =

  61. [67]

    Nanda, Neel and Lieberum, Tom , title =

  62. [68]

    nostalgebraist , title =

  63. [69]

    2020 , month = dec, journal =

    Hubinger, Evan , keywords =. An overview of 11 proposals for building safe advanced AI , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2012.07532 , url =

  64. [70]

    A ttention is not E xplanation

    Jain, Sarthak and Wallace, Byron C. A ttention is not E xplanation. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1357

  65. [71]

    Optimal Brain Damage , url =

    LeCun, Yann and Denker, John and Solla, Sara , booktitle =. Optimal Brain Damage , url =

  66. [72]

    2022 , journal=

    Softmax Linear Units , author=. 2022 , journal=