GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
Towards monosemanticity: Decomposing language models with dictionary learning
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
representative citing papers
Coactivation of sparse autoencoder features reveals causal semantic modules for concepts and relations in LLMs that can be ablated or amplified to produce predictable and counterfactual changes in outputs.
citing papers explorer
-
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
-
Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models
Coactivation of sparse autoencoder features reveals causal semantic modules for concepts and relations in LLMs that can be ablated or amplified to produce predictable and counterfactual changes in outputs.
- Beyond Interpretability: When, Why, and How Sparse Autoencoders Enable Label-Free Visual Steering