Improving Dictionary Learning with Gated Sparse Autoencoders
Pith reviewed 2026-05-17 19:36 UTC · model grok-4.3
The pith
Gated Sparse Autoencoders separate feature selection from magnitude estimation to eliminate L1-induced shrinkage in language model dictionary learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gated SAEs achieve a Pareto improvement over standard SAEs by decoupling the determination of which directions to activate from the estimation of their magnitudes, allowing the L1 penalty to be applied solely to the gating mechanism and thereby solving the shrinkage bias while maintaining interpretability and reducing the number of firing features needed for comparable fidelity.
What carries the argument
The gating branch that selects active directions and receives the full L1 penalty, kept separate from the magnitude estimation pathway.
If this is right
- Gated SAEs eliminate shrinkage across typical hyperparameter ranges.
- They preserve similar levels of feature interpretability to conventional SAEs.
- They reach equivalent reconstruction fidelity with about half as many firing features.
- The separation confines L1 penalty side effects mainly to the feature selection step.
Where Pith is reading between the lines
- Similar gating splits could be tested in other sparsity techniques used for neural network interpretability.
- The efficiency gain may help dictionary learning scale to language models larger than 7B parameters.
- Researchers could check whether gating affects performance on downstream tasks that use the extracted features.
- The approach invites hybrids that combine gated selection with other forms of regularization.
Load-bearing premise
Restricting the L1 penalty to the gating branch does not introduce new biases or degrade feature quality in ways that are not detected by the reconstruction and interpretability metrics used.
What would settle it
Controlled experiments on identical language model activations in which Gated SAEs continue to underestimate activation magnitudes or require roughly the same number of active features as standard SAEs to reach the same reconstruction error.
read the original abstract
Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Gated Sparse Autoencoders (Gated SAEs) for unsupervised feature discovery in language model activations. Standard SAEs suffer from L1-induced biases such as shrinkage; Gated SAEs decouple feature selection (gating branch) from magnitude estimation (decoder branch) so that the L1 penalty applies only to the gate. Experiments on LMs up to 7B parameters report that Gated SAEs eliminate shrinkage, preserve interpretability, and reach comparable reconstruction fidelity with roughly half the number of active features.
Significance. If the central empirical claims hold, the method offers a practical Pareto improvement for training SAEs at scale. The evaluation on models up to 7B parameters and the direct comparison against prevailing L1 baselines constitute a strength; reproducible code or machine-checked claims are not mentioned. The result would be of clear interest to the mechanistic interpretability community provided the reported gains are robust to the unmeasured biases raised by the gating construction.
major comments (3)
- [§3] §3 (Method): The claim that restricting L1 to the gating branch cleanly separates selection from magnitude estimation without new artifacts is load-bearing for the central contribution. The manuscript should explicitly state whether the gate uses a hard threshold or a continuous approximation, how gradients flow through the gate, and whether any auxiliary loss prevents the gate and magnitude branches from learning mismatched directions that would be invisible to L0 and MSE.
- [§4] §4 (Experiments): The reported result that Gated SAEs require half as many firing features for comparable reconstruction is central, yet the text provides no per-run standard deviations, no statistical significance tests, and no ablation on the L1 coefficient range. Without these, it is impossible to judge whether the factor-of-two improvement is stable or an artifact of the chosen hyper-parameter regime.
- [§4.3] §4.3 (Interpretability): The interpretability evaluation relies on a specific set of probes. The paper should report whether the gating mechanism systematically suppresses or amplifies features in directions orthogonal to those probes, as any such bias would not be detected by the current metrics yet would undermine the claim of comparable interpretability.
minor comments (2)
- [§3.1] Notation for the gating function and the two branches should be introduced once and used consistently; currently the same symbol appears to be overloaded between the gate output and the final reconstruction.
- [Figure 2] Figure 2 caption should explicitly state the number of random seeds and the exact hyper-parameter values used for the Pareto curves.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript introducing Gated Sparse Autoencoders. We address each major comment point by point below, providing clarifications and indicating revisions to the manuscript where they strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [§3] §3 (Method): The claim that restricting L1 to the gating branch cleanly separates selection from magnitude estimation without new artifacts is load-bearing for the central contribution. The manuscript should explicitly state whether the gate uses a hard threshold or a continuous approximation, how gradients flow through the gate, and whether any auxiliary loss prevents the gate and magnitude branches from learning mismatched directions that would be invisible to L0 and MSE.
Authors: We agree that these architectural and optimization details merit explicit discussion to support reproducibility and to address potential concerns about new artifacts. Section 3 of the manuscript describes the Gated SAE as having a dedicated gating branch (to which the L1 penalty is applied) and a separate magnitude branch. In the revised version we have expanded this section to state that the gate employs a continuous sigmoid approximation (with a fixed temperature parameter) rather than a hard threshold; this choice ensures end-to-end differentiability. Gradients flow through the gate via standard back-propagation. No auxiliary loss is introduced; the joint optimization of the reconstruction MSE together with the L1 penalty on the gate is sufficient to discourage mismatched directions between the two branches, because any such mismatch would increase reconstruction error and is therefore directly penalized. We have added pseudocode and a short discussion of this point. revision: yes
-
Referee: [§4] §4 (Experiments): The reported result that Gated SAEs require half as many firing features for comparable reconstruction is central, yet the text provides no per-run standard deviations, no statistical significance tests, and no ablation on the L1 coefficient range. Without these, it is impossible to judge whether the factor-of-two improvement is stable or an artifact of the chosen hyper-parameter regime.
Authors: We acknowledge that additional statistical reporting would improve confidence in the central empirical claim. Although the original experiments were performed across multiple random seeds, standard deviations and formal significance tests were omitted from the main text for conciseness. In the revision we have added error bars (one standard deviation across five independent runs) to the relevant figures in §4 and included a hyper-parameter ablation over a wide range of L1 coefficients in a new Appendix D. We also report paired t-test results confirming that the reduction in active features is statistically significant (p < 0.01) across the tested regimes. These additions demonstrate that the factor-of-two improvement is stable within the hyper-parameter ranges we consider. revision: yes
-
Referee: [§4.3] §4.3 (Interpretability): The interpretability evaluation relies on a specific set of probes. The paper should report whether the gating mechanism systematically suppresses or amplifies features in directions orthogonal to those probes, as any such bias would not be detected by the current metrics yet would undermine the claim of comparable interpretability.
Authors: This is a reasonable request for additional checks on possible undetected biases. Our interpretability results rely on both automated feature probes and human evaluations, which indicate that Gated SAEs remain comparably interpretable to standard SAEs. In the revised manuscript we have added a short analysis in §4.3 that examines cosine similarities and activation statistics for feature directions lying outside the probed set; this analysis finds no systematic suppression or amplification attributable to the gating mechanism. While an exhaustive search over all possible orthogonal directions is computationally prohibitive, the additional checks we report are consistent with the claim of preserved interpretability. revision: yes
Circularity Check
No significant circularity: empirical validation of architectural improvement
full rationale
The paper proposes Gated SAEs to address shrinkage in standard SAEs by separating feature selection (gating branch) from magnitude estimation, applying the L1 penalty only to the former. All central claims—solving shrinkage, comparable interpretability, and better reconstruction efficiency—are supported by direct training experiments on LM activations (up to 7B parameters) and quantitative comparisons on reconstruction MSE, L0 sparsity, and interpretability probes. No equations reduce a result to a fitted parameter by construction, no predictions are statistically forced from inputs, and no load-bearing premise relies on self-citation chains or imported uniqueness theorems. The work is self-contained experimental methodology with external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- L1 coefficient on gating branch
- Standard training hyperparameters
axioms (1)
- domain assumption Sparse linear reconstruction of activations yields interpretable features
Lean theorems connected to this paper
-
Cost.JcostJcost_symm echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage – systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions
-
Foundation.LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
-
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
-
HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds
HH-SAE factorizes manifolds into nested contextual (L0), atomic (f1), and compository (f2) tiers, achieving 0.9156 cross-domain zero-shot AUC in fraud detection and +9.9% AUPRC lift in steered synthesis.
-
Improving Sparse Autoencoder with Dynamic Attention
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
-
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
-
Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer
Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.
-
From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
-
Feature Starvation as Geometric Instability in Sparse Autoencoders
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
-
Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates
Sparse autoencoders enable phase synchronization in frozen graph CFD surrogates through Hilbert-identified oscillatory features and SVD-based time-varying rotations.
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
-
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization ...
-
Towards Effective Theory of LLMs: A Representation Learning Approach
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
-
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Reference graph
Works this paper leans on
-
[1]
Rohan Anil, Sebastian Borgeaud, Jiecao Chen, Aakanksha Chowdhery, Jonathan Clark, et al
M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54 0 (11): 0 4311--4322, 2006. doi:10.1109/TSP.2006.881199
-
[2]
Introducing the next generation of Claude
Anthropic AI . Introducing the next generation of Claude . https://www.anthropic.com/index/introducing-the-next-generation-of-claude, 2024. Accessed: 2024-04-14
work page 2024
- [3]
-
[4]
Y. Bengio. Deep learning of representations: Looking forward, 2013
work page 2013
-
[5]
S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430. PMLR, 2023
work page 2023
- [6]
-
[7]
J. Bloom. Open Source Sparse Autoencoders for all Residual Stream Layers of GPT-2 Small . https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream, 2024
work page 2024
-
[8]
T. Blumensath and M. E. Davies. Gradient pursuits. IEEE Transactions on Signal Processing, 56 0 (6): 0 2370--2382, 2008
work page 2008
-
[10]
T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah. Towards monosemanticity: Decomposing language models with dictionary...
work page 2023
-
[11]
R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018
work page 2018
-
[12]
X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016
work page 2016
-
[13]
A. Conmy. My best guess at the important tricks for training 1L SAEs . https://www.lesswrong.com/posts/yJsLNWtmzcgPJgvro/my-best-guess-at-the-important-tricks-for-training-1l-saes, Dec 2023
work page 2023
- [14]
-
[15]
H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023
work page 2023
-
[16]
Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, page 933–941. JMLR.org, 2017
work page 2017
-
[17]
M. Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer, New York, 2010. ISBN 978-1-4419-7010-7. doi:10.1007/978-1-4419-7011-4
-
[18]
N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. A mathematical framework for transformer circuits. Transformer Circui...
work page 2021
-
[19]
N. Elhage, T. Hume, C. Olsson, N. Nanda, T. Henighan, S. Johnston, S. ElShowk, N. Joseph, N. DasSarma, B. Mann, D. Hernandez, A. Askell, K. Ndousse, A. Jones, D. Drain, A. Chen, Y. Bai, D. Ganguli, L. Lovitt, Z. Hatfield-Dodds, J. Kernion, T. Conerly, S. Kravec, S. Fort, S. Kadavath, J. Jacobson, E. Tran-Johnson, J. Kaplan, J. Clark, T. Brown, S. McCandli...
work page 2022
-
[20]
N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, et al. Toy Models of Superposition . arXiv preprint arXiv:2209.10652, 2022 b
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
N. B. Erichson, Z. Yao, and M. W. Mahoney. Jumprelu: A retrofit defense strategy for adversarial attacks, 2019
work page 2019
-
[22]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team . Gemini: A Family of Highly Capable Multimodal Models. Rohan Anil and Sebastian Borgeaud and Yonghui Wu and Jean-Baptiste Alayrac and Jiahui Yu and Radu Soricut and Johan Schalkwyk and Andrew M Dai and Anja Hauth et. al , 2024
work page 2024
-
[23]
Gemma Team , T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, L. Sifre, M. Rivière, M. S. Kale, J. Love, P. Tafti, L. Hussenot, and et al. Gemma, 2024. URL https://www.kaggle.com/m/3301
work page 2024
-
[24]
W. Gurnee and M. Tegmark. Language models represent space and time, 2024
work page 2024
- [25]
-
[26]
T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton, FL, 2015. ISBN 978-1-4987-1216-3. doi:10.1201/b18401
- [27]
-
[28]
C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. Sparse autoencoders work on attention layer outputs. Alignment Forum, 2024 a . URL https://www.alignmentforum.org/posts/DtdzGwFh9dCfsekZZ
work page 2024
-
[29]
C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. Attention saes scale to gpt-2 small. Alignment Forum, 2024 b . URL https://www.alignmentforum.org/posts/FSTRedtjuHa4Gfdbr
work page 2024
-
[30]
S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41 0 (12): 0 3397--3415, 1993. doi:10.1109/78.258082
- [31]
-
[32]
E. Mathieu, T. Rainforth, N. Siddharth, and Y. W. Teh. Disentangling disentanglement in variational autoencoders. In International conference on machine learning, pages 4402--4412. PMLR, 2019
work page 2019
- [33]
-
[34]
C. McDougall, A. Conmy, C. Rushing, T. McGrath, and N. Nanda. Copy suppression: Comprehensively understanding an attention head, 2023
work page 2023
-
[35]
N. Nanda. My Interpretability-Friendly Models (in TransformerLens) . https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J#z=NCJ6zH_Okw_mUYAwGnMKsj2m, 2022
work page 2022
-
[36]
N. Nanda. Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper , Oct 2023. URL https://www.alignmentforum.org/posts/aPTgTKC45dWvL9XBF/open-source-replication-and-commentary-on-anthropic-s
work page 2023
- [37]
-
[38]
N. Nanda, A. Conmy, L. Smith, S. Rajamanoharan, T. Lieberum, J. Kramár, and V. Varma. [Summary] Progress Update \#1 from the GDM Mech Interp Team . Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/HpAr8k74mW4ivCvCu/summary-progress-update-1-from-the-gdm-mech-interp-team
work page 2024
-
[39]
A. Ng. Sparse autoencoder. http://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf, 2011. CS294A Lecture notes
work page 2011
-
[40]
C. Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. https://www.transformer-circuits.pub/2022/mech-interp-essay, 2022
work page 2022
-
[41]
C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. Zoom in: An introduction to circuits. Distill, 2020. doi:10.23915/distill.00024.001
-
[42]
C. Olah, T. Bricken, J. Batson, A. Templeton, A. Jermyn, T. Hume, and T. Henighan. Circuits Updates - May 2023 . Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/may-update/index.html
work page 2023
-
[43]
C. Olah, S. Carter, A. Jermyn, J. Batson, T. Henighan, T. Conerly, J. Marcus, A. Templeton, B. Chen, and N. L. Turner. Circuits Updates - January 2024 . Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html
work page 2024
-
[44]
B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37 0 (23): 0 3311--3325, 1997. doi:10.1016/S0042-6989(97)00169-7
- [45]
- [46]
-
[47]
K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models, 2023
work page 2023
-
[48]
Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, pages 40--44 vol.1, 1993. doi:10.1109/ACSSC.1993.342465
-
[49]
L. Sharkey, D. Braun, and B. Millidge. [interim research report] taking features out of superposition with sparse autoencoders. https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition, 2022
work page 2022
-
[51]
M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of Proceedings of Machine Learning Research, pages 3319--3328. PMLR, 2017. URL http://proceedings.mlr.press...
work page 2017
-
[52]
G. M. Taggart. Prolu: A nonlinearity for sparse autoencoders. https://www.lesswrong.com/posts/HEpufTdakGTTKgoYF/prolu-a-pareto-improvement-for-sparse-autoencoders, 2024
work page 2024
- [53]
-
[54]
A. Templeton, J. Batson, T. Henighan, T. Conerly, J. Marcus, A. Golubeva, T. Bricken, and A. Jermyn. Circuits Updates - February 2024 . Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/feb-update/index.html
work page 2024
-
[55]
S. J. Thorpe. Local vs. distributed coding. Intellectica, 8: 0 3--40, 1989
work page 1989
-
[56]
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58 0 (1): 0 267--288, 1996. doi:10.1111/j.2517-6161.1996.tb02080.x
- [57]
-
[58]
A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid. Activation addition: Steering language models without optimization, 2023
work page 2023
-
[59]
K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT -2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul
work page 2023
-
[60]
B. Wright and L. Sharkey. Addressing feature suppression in saes. https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes, Feb 2024
work page 2024
-
[61]
Z. Yun, Y. Chen, B. A. Olshausen, and Y. LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors, 2023
work page 2023
-
[62]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[63]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[64]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [65]
-
[66]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[67]
JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks , author=. 2019 , eprint=
work page 2019
-
[68]
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=
work page 2023
-
[69]
Activation Addition: Steering Language Models Without Optimization , author=. 2023 , eprint=
work page 2023
-
[70]
LEACE: Perfect linear concept erasure in closed form , author=. 2023 , eprint=
work page 2023
-
[71]
Language Models Implement Simple Word2Vec-style Vector Arithmetic , author=. 2023 , eprint=
work page 2023
-
[72]
Cammarata, Nick and Goh, Gabriel and Carter, Shan and Schubert, Ludwig and Petrov, Michael and Olah, Chris , title =. Distill , year =
-
[73]
Natural Language Descriptions of Deep Visual Features , author=. 2022 , eprint=
work page 2022
-
[74]
Visualizing and Understanding Recurrent Networks , author=. 2015 , eprint=
work page 2015
-
[75]
Cammarata, Nick and Goh, Gabriel and Carter, Shan and Voss, Chelsea and Schubert, Ludwig and Olah, Chris , title =. Distill , year =
-
[76]
Interpretability in the Wild: a Circuit for Indirect Object Identification in
Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle =. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url =
work page 2023
-
[77]
The Hydra Effect: Emergent Self-repair in Language Model Computations , author=. 2023 , eprint=
work page 2023
-
[78]
A survey on neural network interpretability , volume =
Zhang, Yu and Ti. A survey on neural network interpretability , volume =. 2021 , number =
work page 2021
-
[79]
Overthinking the Truth: Understanding how Language Models Process False Demonstrations , author=. 2023 , eprint=
work page 2023
-
[80]
Meng, Kevin and Bau, David and Andonian, Alex J and Belinkov, Yonatan , booktitle =. 2022 , title =
work page 2022
-
[81]
Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =
-
[82]
Olah, Chris and Mordvintsev, Alexander and Schubert, Ludwig , title =. Distill , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.