Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.
Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello- gpt.arXiv preprint arXiv:2402.12201, 2024b
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
citing papers explorer
-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.