GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
arXiv preprint arXiv:2106.00737 (2021)
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Supervised fine-tuning lets LLMs linearly encode action validity and state predicates, with broader state-space coverage during training improving world-model recovery.
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
RL simulations find explicit action guidance in instructions and a specific ordering curriculum improve number composition learning and generalization.
citing papers explorer
-
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
-
A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners
Supervised fine-tuning lets LLMs linearly encode action validity and state predicates, with broader state-space coverage during training improving world-model recovery.
-
Consistency Training while Mitigating Obfuscation via Rate Matching
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
-
A Systematic Study of Behavioral Cloning for Scientific Data Annotation
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
-
Exploring Natural Language-Based Strategies for Efficient Number Learning in Children through Reinforcement Learning
RL simulations find explicit action guidance in instructions and a specific ordering curriculum improve number composition learning and generalization.