pith. sign in

super hub Mixed citations

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Mixed citation behavior. Most common role is background (54%).

106 Pith papers citing it
Background 54% of classified citations
abstract

One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

hub tools

citation-role summary

background 18 method 6 baseline 1 dataset 1

citation-polarity summary

claims ledger

  • abstract One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identif

authors

co-cited works

clear filters

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Slot Machines: How LLMs Keep Track of Multiple Entities

cs.CL · 2026-04-22 · unverdicted · novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

Markovian Circuit Tracing for Transformer State Dynamic

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching improves counterfactual predictions.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

What Cohort INRs Encode and Where to Freeze Them

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

Transformers with Selective Access to Early Representations

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.

Linear-Readout Floors and Threshold Recovery in Computation in Superposition

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contradiction via distinct readout invariants.

Grokking of Diffusion Models: Case Study on Modular Addition

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

citing papers explorer

Showing 5 of 5 citing papers after filters.

  • WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 2 · 2 links · internal anchor

    WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

  • Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 6 · internal anchor

    Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

  • What Cohort INRs Encode and Where to Freeze Them cs.LG · 2026-05-08 · unverdicted · none · ref 10 · internal anchor

    Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

  • Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 25 · 2 links · internal anchor

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  • Towards Understanding the Robustness of Sparse Autoencoders cs.LG · 2026-04-20 · unverdicted · none · ref 5 · internal anchor

    Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.