arxiv: 2404.16014 · v2 · pith:QXWMAQZUnew · submitted 2024-04-24 · 💻 cs.LG · cs.AI

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , J\'anos Kram\'ar , Rohin Shah , Neel Nanda This is my paper

Pith reviewed 2026-05-17 19:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sparse autoencodersgated SAEsdictionary learninglanguage model activationsfeature interpretabilityL1 penaltyshrinkagereconstruction fidelity

0 comments

The pith

Gated Sparse Autoencoders separate feature selection from magnitude estimation to eliminate L1-induced shrinkage in language model dictionary learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse autoencoders uncover interpretable features in language model activations through sparse linear reconstructions of those activations. The standard L1 penalty that encourages sparsity creates shrinkage, a bias that systematically underestimates feature activation strengths. Gated SAEs split the work by using one network branch to select which directions to activate and a second branch to estimate the magnitudes of the selected directions. The L1 penalty is then applied only to the selection branch. Training these models on activations from language models up to 7 billion parameters shows they match standard SAEs in interpretability while reaching comparable reconstruction quality with roughly half the number of active features.

Core claim

Gated SAEs achieve a Pareto improvement over standard SAEs by decoupling the determination of which directions to activate from the estimation of their magnitudes, allowing the L1 penalty to be applied solely to the gating mechanism and thereby solving the shrinkage bias while maintaining interpretability and reducing the number of firing features needed for comparable fidelity.

What carries the argument

The gating branch that selects active directions and receives the full L1 penalty, kept separate from the magnitude estimation pathway.

If this is right

Gated SAEs eliminate shrinkage across typical hyperparameter ranges.
They preserve similar levels of feature interpretability to conventional SAEs.
They reach equivalent reconstruction fidelity with about half as many firing features.
The separation confines L1 penalty side effects mainly to the feature selection step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gating splits could be tested in other sparsity techniques used for neural network interpretability.
The efficiency gain may help dictionary learning scale to language models larger than 7B parameters.
Researchers could check whether gating affects performance on downstream tasks that use the extracted features.
The approach invites hybrids that combine gated selection with other forms of regularization.

Load-bearing premise

Restricting the L1 penalty to the gating branch does not introduce new biases or degrade feature quality in ways that are not detected by the reconstruction and interpretability metrics used.

What would settle it

Controlled experiments on identical language model activations in which Gated SAEs continue to underestimate activation magnitudes or require roughly the same number of active features as standard SAEs to reach the same reconstruction error.

read the original abstract

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gated SAEs fix shrinkage via a clean split between selection and magnitude, with solid gains on real LM activations up to 7B.

read the letter

The main point is that Gated SAEs separate which features to activate from how large those activations should be. They apply the L1 penalty only to the gating branch, which removes the shrinkage bias that standard SAEs get when the penalty hits magnitudes directly. On activations from models up to 7B parameters, this gives comparable reconstruction with roughly half as many active features while keeping interpretability scores about the same. The architecture change is straightforward and directly targets a known problem in the existing SAE literature. The empirical results look reproducible enough on the reported scales, and the paper ships the usual training details plus comparisons to prior variants. The soft spot is the possibility that the separate gate creates its own selection biases or dead-feature patterns that L0, MSE, and the chosen interpretability probes do not catch. If the gate and magnitude branches end up misaligned on some directions, the decoder weights could shift in ways invisible to the metrics shown. The paper probably has ablations, but extra checks on feature stability or cross-run consistency would close that gap. This is aimed at people already training SAEs for mechanistic interpretability work. Anyone who needs fewer active features or less shrinkage will see immediate practical benefit. The results are grounded in real model activations and the idea is simple to test, so it deserves a serious referee.

Referee Report

3 major / 2 minor

Summary. The paper introduces Gated Sparse Autoencoders (Gated SAEs) for unsupervised feature discovery in language model activations. Standard SAEs suffer from L1-induced biases such as shrinkage; Gated SAEs decouple feature selection (gating branch) from magnitude estimation (decoder branch) so that the L1 penalty applies only to the gate. Experiments on LMs up to 7B parameters report that Gated SAEs eliminate shrinkage, preserve interpretability, and reach comparable reconstruction fidelity with roughly half the number of active features.

Significance. If the central empirical claims hold, the method offers a practical Pareto improvement for training SAEs at scale. The evaluation on models up to 7B parameters and the direct comparison against prevailing L1 baselines constitute a strength; reproducible code or machine-checked claims are not mentioned. The result would be of clear interest to the mechanistic interpretability community provided the reported gains are robust to the unmeasured biases raised by the gating construction.

major comments (3)

[§3] §3 (Method): The claim that restricting L1 to the gating branch cleanly separates selection from magnitude estimation without new artifacts is load-bearing for the central contribution. The manuscript should explicitly state whether the gate uses a hard threshold or a continuous approximation, how gradients flow through the gate, and whether any auxiliary loss prevents the gate and magnitude branches from learning mismatched directions that would be invisible to L0 and MSE.
[§4] §4 (Experiments): The reported result that Gated SAEs require half as many firing features for comparable reconstruction is central, yet the text provides no per-run standard deviations, no statistical significance tests, and no ablation on the L1 coefficient range. Without these, it is impossible to judge whether the factor-of-two improvement is stable or an artifact of the chosen hyper-parameter regime.
[§4.3] §4.3 (Interpretability): The interpretability evaluation relies on a specific set of probes. The paper should report whether the gating mechanism systematically suppresses or amplifies features in directions orthogonal to those probes, as any such bias would not be detected by the current metrics yet would undermine the claim of comparable interpretability.

minor comments (2)

[§3.1] Notation for the gating function and the two branches should be introduced once and used consistently; currently the same symbol appears to be overloaded between the gate output and the final reconstruction.
[Figure 2] Figure 2 caption should explicitly state the number of random seeds and the exact hyper-parameter values used for the Pareto curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript introducing Gated Sparse Autoencoders. We address each major comment point by point below, providing clarifications and indicating revisions to the manuscript where they strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [§3] §3 (Method): The claim that restricting L1 to the gating branch cleanly separates selection from magnitude estimation without new artifacts is load-bearing for the central contribution. The manuscript should explicitly state whether the gate uses a hard threshold or a continuous approximation, how gradients flow through the gate, and whether any auxiliary loss prevents the gate and magnitude branches from learning mismatched directions that would be invisible to L0 and MSE.

Authors: We agree that these architectural and optimization details merit explicit discussion to support reproducibility and to address potential concerns about new artifacts. Section 3 of the manuscript describes the Gated SAE as having a dedicated gating branch (to which the L1 penalty is applied) and a separate magnitude branch. In the revised version we have expanded this section to state that the gate employs a continuous sigmoid approximation (with a fixed temperature parameter) rather than a hard threshold; this choice ensures end-to-end differentiability. Gradients flow through the gate via standard back-propagation. No auxiliary loss is introduced; the joint optimization of the reconstruction MSE together with the L1 penalty on the gate is sufficient to discourage mismatched directions between the two branches, because any such mismatch would increase reconstruction error and is therefore directly penalized. We have added pseudocode and a short discussion of this point. revision: yes
Referee: [§4] §4 (Experiments): The reported result that Gated SAEs require half as many firing features for comparable reconstruction is central, yet the text provides no per-run standard deviations, no statistical significance tests, and no ablation on the L1 coefficient range. Without these, it is impossible to judge whether the factor-of-two improvement is stable or an artifact of the chosen hyper-parameter regime.

Authors: We acknowledge that additional statistical reporting would improve confidence in the central empirical claim. Although the original experiments were performed across multiple random seeds, standard deviations and formal significance tests were omitted from the main text for conciseness. In the revision we have added error bars (one standard deviation across five independent runs) to the relevant figures in §4 and included a hyper-parameter ablation over a wide range of L1 coefficients in a new Appendix D. We also report paired t-test results confirming that the reduction in active features is statistically significant (p < 0.01) across the tested regimes. These additions demonstrate that the factor-of-two improvement is stable within the hyper-parameter ranges we consider. revision: yes
Referee: [§4.3] §4.3 (Interpretability): The interpretability evaluation relies on a specific set of probes. The paper should report whether the gating mechanism systematically suppresses or amplifies features in directions orthogonal to those probes, as any such bias would not be detected by the current metrics yet would undermine the claim of comparable interpretability.

Authors: This is a reasonable request for additional checks on possible undetected biases. Our interpretability results rely on both automated feature probes and human evaluations, which indicate that Gated SAEs remain comparably interpretable to standard SAEs. In the revised manuscript we have added a short analysis in §4.3 that examines cosine similarities and activation statistics for feature directions lying outside the probed set; this analysis finds no systematic suppression or amplification attributable to the gating mechanism. While an exhaustive search over all possible orthogonal directions is computationally prohibitive, the additional checks we report are consistent with the claim of preserved interpretability. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical validation of architectural improvement

full rationale

The paper proposes Gated SAEs to address shrinkage in standard SAEs by separating feature selection (gating branch) from magnitude estimation, applying the L1 penalty only to the former. All central claims—solving shrinkage, comparable interpretability, and better reconstruction efficiency—are supported by direct training experiments on LM activations (up to 7B parameters) and quantitative comparisons on reconstruction MSE, L0 sparsity, and interpretability probes. No equations reduce a result to a fitted parameter by construction, no predictions are statistically forced from inputs, and no load-bearing premise relies on self-citation chains or imported uniqueness theorems. The work is self-contained experimental methodology with external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Review limited to abstract; standard autoencoder reconstruction loss and sparsity assumptions are inherited from prior work. The main addition is the architectural split between gating and magnitude estimation.

free parameters (2)

L1 coefficient on gating branch
Hyperparameter controlling sparsity that is now applied only to feature selection rather than to both selection and magnitude.
Standard training hyperparameters
Learning rate, batch size, and other optimizer settings typical for neural network training of SAEs.

axioms (1)

domain assumption Sparse linear reconstruction of activations yields interpretable features
Core premise of the SAE approach stated in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1192 out tokens · 48650 ms · 2026-05-17T19:36:14.845341+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.Jcost Jcost_symm echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage – systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions
Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
cs.LG 2026-05 unverdicted novelty 7.0

SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds
cs.LG 2026-05 unverdicted novelty 7.0

HH-SAE factorizes manifolds into nested contextual (L0), atomic (f1), and compository (f2) tiers, achieving 0.9156 cross-domain zero-shot AUC in fraud detection and +9.9% AUPRC lift in steered synthesis.
Improving Sparse Autoencoder with Dynamic Attention
cs.LG 2026-04 unverdicted novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
cs.LG 2026-04 conditional novelty 7.0

Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 conditional novelty 6.0

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer
cs.LG 2026-05 unverdicted novelty 6.0

Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.
From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
cs.AI 2026-05 conditional novelty 6.0

Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
Feature Starvation as Geometric Instability in Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates
cs.CE 2026-03 unverdicted novelty 6.0

Sparse autoencoders enable phase synchronization in frozen graph CFD surrogates through Hilbert-identified oscillatory features and SVD-based time-varying rotations.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
cs.CV 2024-12 unverdicted novelty 6.0

VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
cs.LG 2024-03 unverdicted novelty 6.0

Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization ...
Towards Effective Theory of LLMs: A Representation Learning Approach
cs.LG 2026-05 unverdicted novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
cs.CL 2026-01 unverdicted novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Reference graph

Works this paper leans on

255 extracted references · 255 canonical work pages · cited by 16 Pith papers · 5 internal anchors

[1]

Rohan Anil, Sebastian Borgeaud, Jiecao Chen, Aakanksha Chowdhery, Jonathan Clark, et al

M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54 0 (11): 0 4311--4322, 2006. doi:10.1109/TSP.2006.881199

work page doi:10.1109/tsp.2006.881199 2006
[2]

Introducing the next generation of Claude

Anthropic AI . Introducing the next generation of Claude . https://www.anthropic.com/index/introducing-the-next-generation-of-claude, 2024. Accessed: 2024-04-14

work page 2024
[3]

Batson, B

J. Batson, B. Chen, A. Jones, A. Templeton, T. Conerly, J. Marcus, T. Henighan, N. L. Turner, and A. Pearce. Circuits Updates - March 2024 . Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/mar-update/index.html

work page 2024
[4]

Y. Bengio. Deep learning of representations: Looking forward, 2013

work page 2013
[5]

Biderman, H

S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430. PMLR, 2023

work page 2023
[6]

Bills, N

S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu, and W. Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023

work page 2023
[7]

J. Bloom. Open Source Sparse Autoencoders for all Residual Stream Layers of GPT-2 Small . https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream, 2024

work page 2024
[8]

Blumensath and M

T. Blumensath and M. E. Davies. Gradient pursuits. IEEE Transactions on Signal Processing, 56 0 (6): 0 2370--2382, 2008

work page 2008
[10]

Bricken, A

T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah. Towards monosemanticity: Decomposing language models with dictionary...

work page 2023
[11]

R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018

work page 2018
[12]

X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016

work page 2016
[13]

A. Conmy. My best guess at the important tricks for training 1L SAEs . https://www.lesswrong.com/posts/yJsLNWtmzcgPJgvro/my-best-guess-at-the-important-tricks-for-training-1l-saes, Dec 2023

work page 2023
[14]

Conmy, A

A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023

work page 2023
[15]

Cunningham, A

H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023

work page 2023
[16]

Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, page 933–941. JMLR.org, 2017

work page 2017
[17]

M. Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer, New York, 2010. ISBN 978-1-4419-7010-7. doi:10.1007/978-1-4419-7011-4

work page doi:10.1007/978-1-4419-7011-4 2010
[18]

Elhage, N

N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. A mathematical framework for transformer circuits. Transformer Circui...

work page 2021
[19]

Elhage, T

N. Elhage, T. Hume, C. Olsson, N. Nanda, T. Henighan, S. Johnston, S. ElShowk, N. Joseph, N. DasSarma, B. Mann, D. Hernandez, A. Askell, K. Ndousse, A. Jones, D. Drain, A. Chen, Y. Bai, D. Ganguli, L. Lovitt, Z. Hatfield-Dodds, J. Kernion, T. Conerly, S. Kravec, S. Fort, S. Kadavath, J. Jacobson, E. Tran-Johnson, J. Kaplan, J. Clark, T. Brown, S. McCandli...

work page 2022
[20]

Toy Models of Superposition

N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, et al. Toy Models of Superposition . arXiv preprint arXiv:2209.10652, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

N. B. Erichson, Z. Yao, and M. W. Mahoney. Jumprelu: A retrofit defense strategy for adversarial attacks, 2019

work page 2019
[22]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team . Gemini: A Family of Highly Capable Multimodal Models. Rohan Anil and Sebastian Borgeaud and Yonghui Wu and Jean-Baptiste Alayrac and Jiahui Yu and Radu Soricut and Johan Schalkwyk and Andrew M Dai and Anja Hauth et. al , 2024

work page 2024
[23]

Mesnard, C

Gemma Team , T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, L. Sifre, M. Rivière, M. S. Kale, J. Love, P. Tafti, L. Hussenot, and et al. Gemma, 2024. URL https://www.kaggle.com/m/3301

work page 2024
[24]

Gurnee and M

W. Gurnee and M. Tegmark. Language models represent space and time, 2024

work page 2024
[25]

Gurnee, N

W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas. Finding neurons in a haystack: Case studies with sparse probing, 2023

work page 2023
[26]

Hastie, R

T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton, FL, 2015. ISBN 978-1-4987-1216-3. doi:10.1201/b18401

work page doi:10.1201/b18401 2015
[27]

Kim and A

H. Kim and A. Mnih. Disentangling by factorising. In International conference on machine learning, pages 2649--2658. PMLR, 2018

work page 2018
[28]

Kissane, R

C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. Sparse autoencoders work on attention layer outputs. Alignment Forum, 2024 a . URL https://www.alignmentforum.org/posts/DtdzGwFh9dCfsekZZ

work page 2024
[29]

Kissane, R

C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. Attention saes scale to gpt-2 small. Alignment Forum, 2024 b . URL https://www.alignmentforum.org/posts/FSTRedtjuHa4Gfdbr

work page 2024
[30]

Mallat and Z

S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41 0 (12): 0 3397--3415, 1993. doi:10.1109/78.258082

work page doi:10.1109/78.258082 1993
[31]

Marks, C

S. Marks, C. Rager, E. J. Michaud, Y. Belinkov, D. Bau, and A. Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2024

work page 2024
[32]

Mathieu, T

E. Mathieu, T. Rainforth, N. Siddharth, and Y. W. Teh. Disentangling disentanglement in variational autoencoders. In International conference on machine learning, pages 4402--4412. PMLR, 2019

work page 2019
[33]

McDougall

C. McDougall. SAE Visualizer . https://github.com/callummcdougall/sae_vis, 2024

work page 2024
[34]

McDougall, A

C. McDougall, A. Conmy, C. Rushing, T. McGrath, and N. Nanda. Copy suppression: Comprehensively understanding an attention head, 2023

work page 2023
[35]

N. Nanda. My Interpretability-Friendly Models (in TransformerLens) . https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J#z=NCJ6zH_Okw_mUYAwGnMKsj2m, 2022

work page 2022
[36]

N. Nanda. Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper , Oct 2023. URL https://www.alignmentforum.org/posts/aPTgTKC45dWvL9XBF/open-source-replication-and-commentary-on-anthropic-s

work page 2023
[37]

Nanda, L

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9XFSbDPmdW

work page 2023
[38]

Nanda, A

N. Nanda, A. Conmy, L. Smith, S. Rajamanoharan, T. Lieberum, J. Kramár, and V. Varma. [Summary] Progress Update \#1 from the GDM Mech Interp Team . Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/HpAr8k74mW4ivCvCu/summary-progress-update-1-from-the-gdm-mech-interp-team

work page 2024
[39]

A. Ng. Sparse autoencoder. http://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf, 2011. CS294A Lecture notes

work page 2011
[40]

C. Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. https://www.transformer-circuits.pub/2022/mech-interp-essay, 2022

work page 2022
[41]

C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. Zoom in: An introduction to circuits. Distill, 2020. doi:10.23915/distill.00024.001

work page doi:10.23915/distill.00024.001 2020
[42]

C. Olah, T. Bricken, J. Batson, A. Templeton, A. Jermyn, T. Hume, and T. Henighan. Circuits Updates - May 2023 . Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/may-update/index.html

work page 2023
[43]

C. Olah, S. Carter, A. Jermyn, J. Batson, T. Henighan, T. Conerly, J. Marcus, A. Templeton, B. Chen, and N. L. Turner. Circuits Updates - January 2024 . Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html

work page 2024
[44]

B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37 0 (23): 0 3311--3325, 1997. doi:10.1016/S0042-6989(97)00169-7

work page doi:10.1016/s0042-6989(97)00169-7 1997
[45]

Olsson, N

C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, et al. In-context learning and induction heads, 2022. URL https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

work page 2022
[46]

GPT-4 Technical Report , 2023

OpenAI. GPT-4 Technical Report , 2023

work page 2023
[47]

K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models, 2023

work page 2023
[48]

Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, pages 40--44 vol.1, 1993. doi:10.1109/ACSSC.1993.342465

work page doi:10.1109/acssc.1993.342465 1993
[49]

Sharkey, D

L. Sharkey, D. Braun, and B. Millidge. [interim research report] taking features out of superposition with sparse autoencoders. https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition, 2022

work page 2022
[51]

Sundararajan, A

M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of Proceedings of Machine Learning Research, pages 3319--3328. PMLR, 2017. URL http://proceedings.mlr.press...

work page 2017
[52]

G. M. Taggart. Prolu: A nonlinearity for sparse autoencoders. https://www.lesswrong.com/posts/HEpufTdakGTTKgoYF/prolu-a-pareto-improvement-for-sparse-autoencoders, 2024

work page 2024
[53]

Tamkin, M

A. Tamkin, M. Taufeeque, and N. D. Goodman. Codebook features: Sparse and discrete interpretability for neural networks, 2023

work page 2023
[54]

Templeton, J

A. Templeton, J. Batson, T. Henighan, T. Conerly, J. Marcus, A. Golubeva, T. Bricken, and A. Jermyn. Circuits Updates - February 2024 . Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/feb-update/index.html

work page 2024
[55]

S. J. Thorpe. Local vs. distributed coding. Intellectica, 8: 0 3--40, 1989

work page 1989
[56]

, year 1996

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58 0 (1): 0 267--288, 1996. doi:10.1111/j.2517-6161.1996.tb02080.x

work page doi:10.1111/j.2517-6161.1996.tb02080.x 1996
[57]

Tigges, O

C. Tigges, O. J. Hollinsworth, A. Geiger, and N. Nanda. Linear representations of sentiment in large language models, 2023

work page 2023
[58]

A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid. Activation addition: Steering language models without optimization, 2023

work page 2023
[59]

K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT -2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul

work page 2023
[60]

Wright and L

B. Wright and L. Sharkey. Addressing feature suppression in saes. https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes, Feb 2024

work page 2024
[61]

Z. Yun, Y. Chen, B. A. Olshausen, and Y. LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors, 2023

work page 2023
[62]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[63]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[64]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[65]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[66]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[67]

2019 , eprint=

JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks , author=. 2019 , eprint=

work page 2019
[68]

2023 , eprint=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=

work page 2023
[69]

2023 , eprint=

Activation Addition: Steering Language Models Without Optimization , author=. 2023 , eprint=

work page 2023
[70]

2023 , eprint=

LEACE: Perfect linear concept erasure in closed form , author=. 2023 , eprint=

work page 2023
[71]

2023 , eprint=

Language Models Implement Simple Word2Vec-style Vector Arithmetic , author=. 2023 , eprint=

work page 2023
[72]

Distill , year =

Cammarata, Nick and Goh, Gabriel and Carter, Shan and Schubert, Ludwig and Petrov, Michael and Olah, Chris , title =. Distill , year =

work page
[73]

2022 , eprint=

Natural Language Descriptions of Deep Visual Features , author=. 2022 , eprint=

work page 2022
[74]

2015 , eprint=

Visualizing and Understanding Recurrent Networks , author=. 2015 , eprint=

work page 2015
[75]

Distill , year =

Cammarata, Nick and Goh, Gabriel and Carter, Shan and Voss, Chelsea and Schubert, Ludwig and Olah, Chris , title =. Distill , year =

work page
[76]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle =. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url =

work page 2023
[77]

2023 , eprint=

The Hydra Effect: Emergent Self-repair in Language Model Computations , author=. 2023 , eprint=

work page 2023
[78]

A survey on neural network interpretability , volume =

Zhang, Yu and Ti. A survey on neural network interpretability , volume =. 2021 , number =

work page 2021
[79]

2023 , eprint=

Overthinking the Truth: Understanding how Language Models Process False Demonstrations , author=. 2023 , eprint=

work page 2023
[80]

2022 , title =

Meng, Kevin and Bau, David and Andonian, Alex J and Belinkov, Yonatan , booktitle =. 2022 , title =

work page 2022
[81]

Distill , year =

Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

work page
[82]

Distill , year =

Olah, Chris and Mordvintsev, Alexander and Schubert, Ludwig , title =. Distill , year =

work page

Showing first 80 references.