pith. machine review for the scientific record.
sign in

arxiv: 2404.16014 · v2 · pith:QXWMAQZUnew · submitted 2024-04-24 · 💻 cs.LG · cs.AI

Improving Dictionary Learning with Gated Sparse Autoencoders

Pith reviewed 2026-05-17 19:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords sparse autoencodersgated SAEsdictionary learninglanguage model activationsfeature interpretabilityL1 penaltyshrinkagereconstruction fidelity
0
0 comments X

The pith

Gated Sparse Autoencoders separate feature selection from magnitude estimation to eliminate L1-induced shrinkage in language model dictionary learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse autoencoders uncover interpretable features in language model activations through sparse linear reconstructions of those activations. The standard L1 penalty that encourages sparsity creates shrinkage, a bias that systematically underestimates feature activation strengths. Gated SAEs split the work by using one network branch to select which directions to activate and a second branch to estimate the magnitudes of the selected directions. The L1 penalty is then applied only to the selection branch. Training these models on activations from language models up to 7 billion parameters shows they match standard SAEs in interpretability while reaching comparable reconstruction quality with roughly half the number of active features.

Core claim

Gated SAEs achieve a Pareto improvement over standard SAEs by decoupling the determination of which directions to activate from the estimation of their magnitudes, allowing the L1 penalty to be applied solely to the gating mechanism and thereby solving the shrinkage bias while maintaining interpretability and reducing the number of firing features needed for comparable fidelity.

What carries the argument

The gating branch that selects active directions and receives the full L1 penalty, kept separate from the magnitude estimation pathway.

If this is right

  • Gated SAEs eliminate shrinkage across typical hyperparameter ranges.
  • They preserve similar levels of feature interpretability to conventional SAEs.
  • They reach equivalent reconstruction fidelity with about half as many firing features.
  • The separation confines L1 penalty side effects mainly to the feature selection step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar gating splits could be tested in other sparsity techniques used for neural network interpretability.
  • The efficiency gain may help dictionary learning scale to language models larger than 7B parameters.
  • Researchers could check whether gating affects performance on downstream tasks that use the extracted features.
  • The approach invites hybrids that combine gated selection with other forms of regularization.

Load-bearing premise

Restricting the L1 penalty to the gating branch does not introduce new biases or degrade feature quality in ways that are not detected by the reconstruction and interpretability metrics used.

What would settle it

Controlled experiments on identical language model activations in which Gated SAEs continue to underestimate activation magnitudes or require roughly the same number of active features as standard SAEs to reach the same reconstruction error.

read the original abstract

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Gated Sparse Autoencoders (Gated SAEs) for unsupervised feature discovery in language model activations. Standard SAEs suffer from L1-induced biases such as shrinkage; Gated SAEs decouple feature selection (gating branch) from magnitude estimation (decoder branch) so that the L1 penalty applies only to the gate. Experiments on LMs up to 7B parameters report that Gated SAEs eliminate shrinkage, preserve interpretability, and reach comparable reconstruction fidelity with roughly half the number of active features.

Significance. If the central empirical claims hold, the method offers a practical Pareto improvement for training SAEs at scale. The evaluation on models up to 7B parameters and the direct comparison against prevailing L1 baselines constitute a strength; reproducible code or machine-checked claims are not mentioned. The result would be of clear interest to the mechanistic interpretability community provided the reported gains are robust to the unmeasured biases raised by the gating construction.

major comments (3)
  1. [§3] §3 (Method): The claim that restricting L1 to the gating branch cleanly separates selection from magnitude estimation without new artifacts is load-bearing for the central contribution. The manuscript should explicitly state whether the gate uses a hard threshold or a continuous approximation, how gradients flow through the gate, and whether any auxiliary loss prevents the gate and magnitude branches from learning mismatched directions that would be invisible to L0 and MSE.
  2. [§4] §4 (Experiments): The reported result that Gated SAEs require half as many firing features for comparable reconstruction is central, yet the text provides no per-run standard deviations, no statistical significance tests, and no ablation on the L1 coefficient range. Without these, it is impossible to judge whether the factor-of-two improvement is stable or an artifact of the chosen hyper-parameter regime.
  3. [§4.3] §4.3 (Interpretability): The interpretability evaluation relies on a specific set of probes. The paper should report whether the gating mechanism systematically suppresses or amplifies features in directions orthogonal to those probes, as any such bias would not be detected by the current metrics yet would undermine the claim of comparable interpretability.
minor comments (2)
  1. [§3.1] Notation for the gating function and the two branches should be introduced once and used consistently; currently the same symbol appears to be overloaded between the gate output and the final reconstruction.
  2. [Figure 2] Figure 2 caption should explicitly state the number of random seeds and the exact hyper-parameter values used for the Pareto curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript introducing Gated Sparse Autoencoders. We address each major comment point by point below, providing clarifications and indicating revisions to the manuscript where they strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The claim that restricting L1 to the gating branch cleanly separates selection from magnitude estimation without new artifacts is load-bearing for the central contribution. The manuscript should explicitly state whether the gate uses a hard threshold or a continuous approximation, how gradients flow through the gate, and whether any auxiliary loss prevents the gate and magnitude branches from learning mismatched directions that would be invisible to L0 and MSE.

    Authors: We agree that these architectural and optimization details merit explicit discussion to support reproducibility and to address potential concerns about new artifacts. Section 3 of the manuscript describes the Gated SAE as having a dedicated gating branch (to which the L1 penalty is applied) and a separate magnitude branch. In the revised version we have expanded this section to state that the gate employs a continuous sigmoid approximation (with a fixed temperature parameter) rather than a hard threshold; this choice ensures end-to-end differentiability. Gradients flow through the gate via standard back-propagation. No auxiliary loss is introduced; the joint optimization of the reconstruction MSE together with the L1 penalty on the gate is sufficient to discourage mismatched directions between the two branches, because any such mismatch would increase reconstruction error and is therefore directly penalized. We have added pseudocode and a short discussion of this point. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported result that Gated SAEs require half as many firing features for comparable reconstruction is central, yet the text provides no per-run standard deviations, no statistical significance tests, and no ablation on the L1 coefficient range. Without these, it is impossible to judge whether the factor-of-two improvement is stable or an artifact of the chosen hyper-parameter regime.

    Authors: We acknowledge that additional statistical reporting would improve confidence in the central empirical claim. Although the original experiments were performed across multiple random seeds, standard deviations and formal significance tests were omitted from the main text for conciseness. In the revision we have added error bars (one standard deviation across five independent runs) to the relevant figures in §4 and included a hyper-parameter ablation over a wide range of L1 coefficients in a new Appendix D. We also report paired t-test results confirming that the reduction in active features is statistically significant (p < 0.01) across the tested regimes. These additions demonstrate that the factor-of-two improvement is stable within the hyper-parameter ranges we consider. revision: yes

  3. Referee: [§4.3] §4.3 (Interpretability): The interpretability evaluation relies on a specific set of probes. The paper should report whether the gating mechanism systematically suppresses or amplifies features in directions orthogonal to those probes, as any such bias would not be detected by the current metrics yet would undermine the claim of comparable interpretability.

    Authors: This is a reasonable request for additional checks on possible undetected biases. Our interpretability results rely on both automated feature probes and human evaluations, which indicate that Gated SAEs remain comparably interpretable to standard SAEs. In the revised manuscript we have added a short analysis in §4.3 that examines cosine similarities and activation statistics for feature directions lying outside the probed set; this analysis finds no systematic suppression or amplification attributable to the gating mechanism. While an exhaustive search over all possible orthogonal directions is computationally prohibitive, the additional checks we report are consistent with the claim of preserved interpretability. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical validation of architectural improvement

full rationale

The paper proposes Gated SAEs to address shrinkage in standard SAEs by separating feature selection (gating branch) from magnitude estimation, applying the L1 penalty only to the former. All central claims—solving shrinkage, comparable interpretability, and better reconstruction efficiency—are supported by direct training experiments on LM activations (up to 7B parameters) and quantitative comparisons on reconstruction MSE, L0 sparsity, and interpretability probes. No equations reduce a result to a fitted parameter by construction, no predictions are statistically forced from inputs, and no load-bearing premise relies on self-citation chains or imported uniqueness theorems. The work is self-contained experimental methodology with external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Review limited to abstract; standard autoencoder reconstruction loss and sparsity assumptions are inherited from prior work. The main addition is the architectural split between gating and magnitude estimation.

free parameters (2)
  • L1 coefficient on gating branch
    Hyperparameter controlling sparsity that is now applied only to feature selection rather than to both selection and magnitude.
  • Standard training hyperparameters
    Learning rate, batch size, and other optimizer settings typical for neural network training of SAEs.
axioms (1)
  • domain assumption Sparse linear reconstruction of activations yields interpretable features
    Core premise of the SAE approach stated in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1192 out tokens · 48650 ms · 2026-05-17T19:36:14.845341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.Jcost Jcost_symm echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage – systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions

  • Foundation.LawOfExistence defect_zero_iff_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  2. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.

  3. SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

    cs.LG 2026-05 unverdicted novelty 7.0

    SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.

  4. HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds

    cs.LG 2026-05 unverdicted novelty 7.0

    HH-SAE factorizes manifolds into nested contextual (L0), atomic (f1), and compository (f2) tiers, achieving 0.9156 cross-domain zero-shot AUC in fraud detection and +9.9% AUPRC lift in steered synthesis.

  5. Improving Sparse Autoencoder with Dynamic Attention

    cs.LG 2026-04 unverdicted novelty 7.0

    A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.

  6. MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

    cs.LG 2026-04 conditional novelty 7.0

    Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.

  7. Scaling and evaluating sparse autoencoders

    cs.LG 2024-06 unverdicted novelty 7.0

    K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

  8. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 conditional novelty 6.0

    DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.

  9. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 unverdicted novelty 6.0

    DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.

  10. Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer

    cs.LG 2026-05 unverdicted novelty 6.0

    Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.

  11. From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

    cs.AI 2026-05 conditional novelty 6.0

    Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.

  12. Feature Starvation as Geometric Instability in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...

  13. Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...

  14. Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates

    cs.CE 2026-03 unverdicted novelty 6.0

    Sparse autoencoders enable phase synchronization in frozen graph CFD surrogates through Hilbert-identified oscillatory features and SVD-based time-varying rotations.

  15. MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    cs.CV 2024-12 unverdicted novelty 6.0

    VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.

  16. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    cs.LG 2024-03 unverdicted novelty 6.0

    Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization ...

  17. Towards Effective Theory of LLMs: A Representation Learning Approach

    cs.LG 2026-05 unverdicted novelty 5.0

    RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

  18. Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

    cs.CL 2026-01 unverdicted novelty 5.0

    The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Reference graph

Works this paper leans on

255 extracted references · 255 canonical work pages · cited by 16 Pith papers · 5 internal anchors

  1. [1]

    Rohan Anil, Sebastian Borgeaud, Jiecao Chen, Aakanksha Chowdhery, Jonathan Clark, et al

    M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54 0 (11): 0 4311--4322, 2006. doi:10.1109/TSP.2006.881199

  2. [2]

    Introducing the next generation of Claude

    Anthropic AI . Introducing the next generation of Claude . https://www.anthropic.com/index/introducing-the-next-generation-of-claude, 2024. Accessed: 2024-04-14

  3. [3]

    Batson, B

    J. Batson, B. Chen, A. Jones, A. Templeton, T. Conerly, J. Marcus, T. Henighan, N. L. Turner, and A. Pearce. Circuits Updates - March 2024 . Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/mar-update/index.html

  4. [4]

    Y. Bengio. Deep learning of representations: Looking forward, 2013

  5. [5]

    Biderman, H

    S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430. PMLR, 2023

  6. [6]

    Bills, N

    S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu, and W. Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023

  7. [7]

    J. Bloom. Open Source Sparse Autoencoders for all Residual Stream Layers of GPT-2 Small . https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream, 2024

  8. [8]

    Blumensath and M

    T. Blumensath and M. E. Davies. Gradient pursuits. IEEE Transactions on Signal Processing, 56 0 (6): 0 2370--2382, 2008

  9. [10]

    Bricken, A

    T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah. Towards monosemanticity: Decomposing language models with dictionary...

  10. [11]

    R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018

  11. [12]

    X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016

  12. [13]

    A. Conmy. My best guess at the important tricks for training 1L SAEs . https://www.lesswrong.com/posts/yJsLNWtmzcgPJgvro/my-best-guess-at-the-important-tricks-for-training-1l-saes, Dec 2023

  13. [14]

    Conmy, A

    A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023

  14. [15]

    Cunningham, A

    H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023

  15. [16]

    Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, page 933–941. JMLR.org, 2017

  16. [17]

    M. Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer, New York, 2010. ISBN 978-1-4419-7010-7. doi:10.1007/978-1-4419-7011-4

  17. [18]

    Elhage, N

    N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. A mathematical framework for transformer circuits. Transformer Circui...

  18. [19]

    Elhage, T

    N. Elhage, T. Hume, C. Olsson, N. Nanda, T. Henighan, S. Johnston, S. ElShowk, N. Joseph, N. DasSarma, B. Mann, D. Hernandez, A. Askell, K. Ndousse, A. Jones, D. Drain, A. Chen, Y. Bai, D. Ganguli, L. Lovitt, Z. Hatfield-Dodds, J. Kernion, T. Conerly, S. Kravec, S. Fort, S. Kadavath, J. Jacobson, E. Tran-Johnson, J. Kaplan, J. Clark, T. Brown, S. McCandli...

  19. [20]

    Toy Models of Superposition

    N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, et al. Toy Models of Superposition . arXiv preprint arXiv:2209.10652, 2022 b

  20. [21]

    N. B. Erichson, Z. Yao, and M. W. Mahoney. Jumprelu: A retrofit defense strategy for adversarial attacks, 2019

  21. [22]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team . Gemini: A Family of Highly Capable Multimodal Models. Rohan Anil and Sebastian Borgeaud and Yonghui Wu and Jean-Baptiste Alayrac and Jiahui Yu and Radu Soricut and Johan Schalkwyk and Andrew M Dai and Anja Hauth et. al , 2024

  22. [23]

    Mesnard, C

    Gemma Team , T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, L. Sifre, M. Rivière, M. S. Kale, J. Love, P. Tafti, L. Hussenot, and et al. Gemma, 2024. URL https://www.kaggle.com/m/3301

  23. [24]

    Gurnee and M

    W. Gurnee and M. Tegmark. Language models represent space and time, 2024

  24. [25]

    Gurnee, N

    W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas. Finding neurons in a haystack: Case studies with sparse probing, 2023

  25. [26]

    Hastie, R

    T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton, FL, 2015. ISBN 978-1-4987-1216-3. doi:10.1201/b18401

  26. [27]

    Kim and A

    H. Kim and A. Mnih. Disentangling by factorising. In International conference on machine learning, pages 2649--2658. PMLR, 2018

  27. [28]

    Kissane, R

    C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. Sparse autoencoders work on attention layer outputs. Alignment Forum, 2024 a . URL https://www.alignmentforum.org/posts/DtdzGwFh9dCfsekZZ

  28. [29]

    Kissane, R

    C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. Attention saes scale to gpt-2 small. Alignment Forum, 2024 b . URL https://www.alignmentforum.org/posts/FSTRedtjuHa4Gfdbr

  29. [30]

    Mallat and Z

    S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41 0 (12): 0 3397--3415, 1993. doi:10.1109/78.258082

  30. [31]

    Marks, C

    S. Marks, C. Rager, E. J. Michaud, Y. Belinkov, D. Bau, and A. Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2024

  31. [32]

    Mathieu, T

    E. Mathieu, T. Rainforth, N. Siddharth, and Y. W. Teh. Disentangling disentanglement in variational autoencoders. In International conference on machine learning, pages 4402--4412. PMLR, 2019

  32. [33]

    McDougall

    C. McDougall. SAE Visualizer . https://github.com/callummcdougall/sae_vis, 2024

  33. [34]

    McDougall, A

    C. McDougall, A. Conmy, C. Rushing, T. McGrath, and N. Nanda. Copy suppression: Comprehensively understanding an attention head, 2023

  34. [35]

    N. Nanda. My Interpretability-Friendly Models (in TransformerLens) . https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J#z=NCJ6zH_Okw_mUYAwGnMKsj2m, 2022

  35. [36]

    N. Nanda. Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper , Oct 2023. URL https://www.alignmentforum.org/posts/aPTgTKC45dWvL9XBF/open-source-replication-and-commentary-on-anthropic-s

  36. [37]

    Nanda, L

    N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9XFSbDPmdW

  37. [38]

    Nanda, A

    N. Nanda, A. Conmy, L. Smith, S. Rajamanoharan, T. Lieberum, J. Kramár, and V. Varma. [Summary] Progress Update \#1 from the GDM Mech Interp Team . Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/HpAr8k74mW4ivCvCu/summary-progress-update-1-from-the-gdm-mech-interp-team

  38. [39]

    A. Ng. Sparse autoencoder. http://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf, 2011. CS294A Lecture notes

  39. [40]

    C. Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. https://www.transformer-circuits.pub/2022/mech-interp-essay, 2022

  40. [41]

    C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. Zoom in: An introduction to circuits. Distill, 2020. doi:10.23915/distill.00024.001

  41. [42]

    C. Olah, T. Bricken, J. Batson, A. Templeton, A. Jermyn, T. Hume, and T. Henighan. Circuits Updates - May 2023 . Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/may-update/index.html

  42. [43]

    C. Olah, S. Carter, A. Jermyn, J. Batson, T. Henighan, T. Conerly, J. Marcus, A. Templeton, B. Chen, and N. L. Turner. Circuits Updates - January 2024 . Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html

  43. [44]

    B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37 0 (23): 0 3311--3325, 1997. doi:10.1016/S0042-6989(97)00169-7

  44. [45]

    Olsson, N

    C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, et al. In-context learning and induction heads, 2022. URL https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

  45. [46]

    GPT-4 Technical Report , 2023

    OpenAI. GPT-4 Technical Report , 2023

  46. [47]

    K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models, 2023

  47. [48]

    Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, pages 40--44 vol.1, 1993. doi:10.1109/ACSSC.1993.342465

  48. [49]

    Sharkey, D

    L. Sharkey, D. Braun, and B. Millidge. [interim research report] taking features out of superposition with sparse autoencoders. https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition, 2022

  49. [51]

    Sundararajan, A

    M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of Proceedings of Machine Learning Research, pages 3319--3328. PMLR, 2017. URL http://proceedings.mlr.press...

  50. [52]

    G. M. Taggart. Prolu: A nonlinearity for sparse autoencoders. https://www.lesswrong.com/posts/HEpufTdakGTTKgoYF/prolu-a-pareto-improvement-for-sparse-autoencoders, 2024

  51. [53]

    Tamkin, M

    A. Tamkin, M. Taufeeque, and N. D. Goodman. Codebook features: Sparse and discrete interpretability for neural networks, 2023

  52. [54]

    Templeton, J

    A. Templeton, J. Batson, T. Henighan, T. Conerly, J. Marcus, A. Golubeva, T. Bricken, and A. Jermyn. Circuits Updates - February 2024 . Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/feb-update/index.html

  53. [55]

    S. J. Thorpe. Local vs. distributed coding. Intellectica, 8: 0 3--40, 1989

  54. [56]

    , year 1996

    R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58 0 (1): 0 267--288, 1996. doi:10.1111/j.2517-6161.1996.tb02080.x

  55. [57]

    Tigges, O

    C. Tigges, O. J. Hollinsworth, A. Geiger, and N. Nanda. Linear representations of sentiment in large language models, 2023

  56. [58]

    A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid. Activation addition: Steering language models without optimization, 2023

  57. [59]

    K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT -2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul

  58. [60]

    Wright and L

    B. Wright and L. Sharkey. Addressing feature suppression in saes. https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes, Feb 2024

  59. [61]

    Z. Yun, Y. Chen, B. A. Olshausen, and Y. LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors, 2023

  60. [62]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  61. [63]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  62. [64]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  63. [65]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  64. [66]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  65. [67]

    2019 , eprint=

    JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks , author=. 2019 , eprint=

  66. [68]

    2023 , eprint=

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=

  67. [69]

    2023 , eprint=

    Activation Addition: Steering Language Models Without Optimization , author=. 2023 , eprint=

  68. [70]

    2023 , eprint=

    LEACE: Perfect linear concept erasure in closed form , author=. 2023 , eprint=

  69. [71]

    2023 , eprint=

    Language Models Implement Simple Word2Vec-style Vector Arithmetic , author=. 2023 , eprint=

  70. [72]

    Distill , year =

    Cammarata, Nick and Goh, Gabriel and Carter, Shan and Schubert, Ludwig and Petrov, Michael and Olah, Chris , title =. Distill , year =

  71. [73]

    2022 , eprint=

    Natural Language Descriptions of Deep Visual Features , author=. 2022 , eprint=

  72. [74]

    2015 , eprint=

    Visualizing and Understanding Recurrent Networks , author=. 2015 , eprint=

  73. [75]

    Distill , year =

    Cammarata, Nick and Goh, Gabriel and Carter, Shan and Voss, Chelsea and Schubert, Ludwig and Olah, Chris , title =. Distill , year =

  74. [76]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in

    Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle =. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url =

  75. [77]

    2023 , eprint=

    The Hydra Effect: Emergent Self-repair in Language Model Computations , author=. 2023 , eprint=

  76. [78]

    A survey on neural network interpretability , volume =

    Zhang, Yu and Ti. A survey on neural network interpretability , volume =. 2021 , number =

  77. [79]

    2023 , eprint=

    Overthinking the Truth: Understanding how Language Models Process False Demonstrations , author=. 2023 , eprint=

  78. [80]

    2022 , title =

    Meng, Kevin and Bau, David and Andonian, Alex J and Belinkov, Yonatan , booktitle =. 2022 , title =

  79. [81]

    Distill , year =

    Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

  80. [82]

    Distill , year =

    Olah, Chris and Mordvintsev, Alexander and Schubert, Ludwig , title =. Distill , year =

Showing first 80 references.