pith. sign in

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

fields

cs.AI 2

years

2026 2

verdicts

UNVERDICTED 2

representative citing papers

Patch-Effect Graph Kernels for LLM Interpretability

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape descriptors and raw baselines on GPT-2 Small.

citing papers explorer

Showing 2 of 2 citing papers.

  • Patch-Effect Graph Kernels for LLM Interpretability cs.AI · 2026-05-07 · unverdicted · none · ref 1

    Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape descriptors and raw baselines on GPT-2 Small.

  • Geometric Routing Enables Causal Expert Control in Mixture of Experts cs.AI · 2026-04-15 · unverdicted · none · ref 4

    Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.