Rigorously Assessing Natural Language Explanations of Neurons

Huang, Jing, Atticus Geiger, Karel D’Oosterlinck, Zhengxuan Wu, Christopher Potts (Sept · 2023 · arXiv 2309.10312

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

cs.LG · 2026-05-13 · accept · novelty 8.0

Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.

How to use and interpret activation patching

cs.LG · 2024-04-23 · accept · novelty 5.0

Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.

citing papers explorer

Showing 2 of 2 citing papers.

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features cs.LG · 2026-05-13 · accept · none · ref 13
Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
How to use and interpret activation patching cs.LG · 2024-04-23 · accept · none · ref 16
Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.

Rigorously Assessing Natural Language Explanations of Neurons

fields

years

verdicts

representative citing papers

citing papers explorer