A five-stage causal feature analysis methodology is proposed and tested on GPT-2 for IOI, showing partial causality of SAE features, robustness differences under shifts, and deployment cost benefits.
Turner, Callum McDougall, Monte MacDiarmid, C
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models
A five-stage causal feature analysis methodology is proposed and tested on GPT-2 for IOI, showing partial causality of SAE features, robustness differences under shifts, and deployment cost benefits.