GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
arXiv preprint arXiv:2106.00737 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
RL simulations find explicit action guidance in instructions and a specific ordering curriculum improve number composition learning and generalization.
citing papers explorer
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
-
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Exploring Natural Language-Based Strategies for Efficient Number Learning in Children through Reinforcement Learning
RL simulations find explicit action guidance in instructions and a specific ordering curriculum improve number composition learning and generalization.