Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
Rigorously Assessing Natural Language Explanations of Neurons
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2verdicts
ACCEPT 2representative citing papers
Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.
citing papers explorer
-
Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features
Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
-
How to use and interpret activation patching
Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.