Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatf · 2023

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.

Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models

cs.CL · 2025-06-22 · unverdicted · novelty 6.0

Coactivation of sparse autoencoder features reveals causal semantic modules for concepts and relations in LLMs that can be ablated or amplified to produce predictable and counterfactual changes in outputs.

Beyond Interpretability: When, Why, and How Sparse Autoencoders Enable Label-Free Visual Steering

cs.CV · 2025-06-02

citing papers explorer

Showing 3 of 3 citing papers.

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift cs.LG · 2026-05-21 · unverdicted · none · ref 4
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models cs.CL · 2025-06-22 · unverdicted · none · ref 5
Coactivation of sparse autoencoder features reveals causal semantic modules for concepts and relations in LLMs that can be ablated or amplified to produce predictable and counterfactual changes in outputs.
Beyond Interpretability: When, Why, and How Sparse Autoencoders Enable Label-Free Visual Steering cs.CV · 2025-06-02 · unreviewed · ref 28

Towards monosemanticity: Decomposing language models with dictionary learning

fields

years

verdicts

representative citing papers

citing papers explorer