Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.
International Conference on Learning Representations (ICLR) , year=
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2023 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.