Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
Sparse autoencoders do not find canonical units of analysis.arXiv preprint arXiv:2502.04878, 2025
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.
SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
A new framework combines AI-derived concept embeddings with high-dimensional selective inference to enable statistically principled, interpretable discovery from unstructured data in empirical economics.
Mechanistic interpretability papers substitute validation metrics for causal identification without stating assumptions, and the authors propose a disclosure norm requiring explicit identification strategies.
Sparse autoencoders show OOD prompts increase fallacious concept activation in transformers, offering a mechanistic measure of shift and a path to robust fine-tuning.
citing papers explorer
-
Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features
Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.