Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al · 2023

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Patch-Effect Graph Kernels for LLM Interpretability

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape descriptors and raw baselines on GPT-2 Small.

Geometric Routing Enables Causal Expert Control in Mixture of Experts

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.

citing papers explorer

Showing 2 of 2 citing papers.

Patch-Effect Graph Kernels for LLM Interpretability cs.AI · 2026-05-07 · unverdicted · none · ref 1
Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape descriptors and raw baselines on GPT-2 Small.
Geometric Routing Enables Causal Expert Control in Mixture of Experts cs.AI · 2026-04-15 · unverdicted · none · ref 4
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

fields

years

verdicts

representative citing papers

citing papers explorer