Title resolution pending

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author= · 2023

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight subnetworks.

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.

SMIXAE: Towards Unsupervised Manifold Discovery in Language Models

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

SMIXAE is a new mixture-of-autoencoders architecture that learns multidimensional manifolds directly from transformer activations, recovering known structures and identifying novel ones in Gemma 2 2B and 9B models.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Improving Dictionary Learning with Gated Sparse Autoencoders

cs.LG · 2024-04-24 · unverdicted · novelty 7.0

Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

Bilinear autoencoders find interpretable manifolds

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation assumptions.

The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.

Understanding Annotator Safety Policy with Interpretability

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.

Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.

Understanding the Mechanism of Altruism in Large Language Models

econ.GN · 2026-04-21 · unverdicted · novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

Linear Representations of Sentiment in Large Language Models

cs.LG · 2023-10-23 · unverdicted · novelty 6.0

Sentiment is represented as a single linear direction in LLM activation space that is causally relevant across tasks and is summarized at punctuation and names in addition to charged words.

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

There Will Be a Scientific Theory of Deep Learning

stat.ML · 2026-04-23 · unverdicted · novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

How Language Models Process Negation

cs.CL · 2026-05-04

citing papers explorer

Showing 18 of 18 citing papers.

Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space cs.LG · 2026-05-18 · unverdicted · none · ref 43
In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight subnetworks.
Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space cs.LG · 2026-05-15 · unverdicted · none · ref 45
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
SMIXAE: Towards Unsupervised Manifold Discovery in Language Models cs.LG · 2026-05-09 · unverdicted · none · ref 10
SMIXAE is a new mixture-of-autoencoders architecture that learns multidimensional manifolds directly from transformer activations, recovering known structures and identifying novel ones in Gemma 2 2B and 9B models.
From Mechanistic to Compositional Interpretability cs.LG · 2026-05-09 · unverdicted · none · ref 128
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 45
LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 18
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Improving Dictionary Learning with Gated Sparse Autoencoders cs.LG · 2024-04-24 · unverdicted · none · ref 209
Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 30
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 24
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Bilinear autoencoders find interpretable manifolds cs.LG · 2026-05-09 · unverdicted · none · ref 12
Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation assumptions.
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans cs.CL · 2026-05-09 · unverdicted · none · ref 38
LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.
Understanding Annotator Safety Policy with Interpretability cs.AI · 2026-05-06 · unverdicted · none · ref 51
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives cs.CL · 2026-04-22 · unverdicted · none · ref 73
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
Understanding the Mechanism of Altruism in Large Language Models econ.GN · 2026-04-21 · unverdicted · none · ref 232
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Linear Representations of Sentiment in Large Language Models cs.LG · 2023-10-23 · unverdicted · none · ref 64
Sentiment is represented as a single linear direction in LLM activation space that is causally relevant across tasks and is summarized at punctuation and names in addition to charged words.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 74
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 211
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
How Language Models Process Negation cs.CL · 2026-05-04 · unreviewed · ref 29

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer