Path patching provides a method to express and quantitatively test hypotheses that neural network behaviors are localized to sets of paths.
Zoom in: An introduction to circuits
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2023 2verdicts
UNVERDICTED 2representative citing papers
Sparse autoencoders applied to language model activations yield more interpretable and monosemantic features than alternative approaches, enabling finer causal analysis on the indirect object identification task.
citing papers explorer
-
Localizing Model Behavior with Path Patching
Path patching provides a method to express and quantitatively test hypotheses that neural network behaviors are localized to sets of paths.
-
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Sparse autoencoders applied to language model activations yield more interpretable and monosemantic features than alternative approaches, enabling finer causal analysis on the indirect object identification task.