SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 3years
2026 3verdicts
UNVERDICTED 3roles
background 2polarities
background 2representative citing papers
In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
citing papers explorer
-
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
-
Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers
In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.
-
Towards Effective Theory of LLMs: A Representation Learning Approach
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.