A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
On privileged and convergent bases in neural network representations.arXiv
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Stimulus symmetries render many neural representations functionally equivalent yet produce qualitatively different RSMs, including drifting ones from SGD or regularization in image-encoding networks.
Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.
citing papers explorer
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Stimulus symmetries can confound representational similarity analyses
Stimulus symmetries render many neural representations functionally equivalent yet produce qualitatively different RSMs, including drifting ones from SGD or regularization in image-encoding networks.
-
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.