pith. sign in

arxiv: 2605.18629 · v1 · pith:5FYGF45Jnew · submitted 2026-05-18 · 💻 cs.LG

Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

Pith reviewed 2026-05-20 12:11 UTC · model grok-4.3

classification 💻 cs.LG
keywords sparse autoencodersneural network interpretabilitydead featurestraining stabilityalignment scorereparameterizationdictionary learningSAEBench
0
0 comments X

The pith

Reparameterizing sparse autoencoders to force the inner product of each encoder and decoder direction to equal one removes a source of training degeneracy and yields better features without new hyperparameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard SAE training produces a bimodal distribution of alignment scores between encoder and decoder weights for each feature, leaving many features poorly aligned. This misalignment is tied to dead features that stay inactive and to high variance across random seeds. The aligned training method reparameterizes the model so that every feature's encoder and decoder directions satisfy an inner-product constraint of exactly one. The resulting models show higher reconstruction fidelity, near-zero dead features, and greater stability on SAEBench across different models, dictionary sizes, and sparsity targets. The change adds no extra data, resampling steps, or tunable parameters and works alongside existing SAE improvements such as Top-K and p-annealing.

Core claim

The paper establishes that the overlooked bimodality in alignment scores (inner product of encoder and decoder directions) is a controllable source of degeneracy. By enforcing the geometric constraint that this inner product equals one for every feature through a simple reparameterization, the training dynamics are altered so that dead features disappear, reconstruction quality rises, and run-to-run stability improves, all without introducing hyperparameters or extra computational cost.

What carries the argument

The aligned training reparameterization, which directly constrains the encoder-decoder inner product to equal one for each feature and thereby fixes the geometric relationship between the learned directions.

If this is right

  • SAEs trained with the constraint achieve Pareto improvements on reconstruction-versus-sparsity trade-offs.
  • Dead features are eliminated across multiple model families and sparsity regimes without resampling or auxiliary losses.
  • Feature sets become more stable across different random seeds, reducing the need for seed averaging.
  • The method composes directly with Top-K, BatchTop-K, and p-annealing architectures.
  • The same reparameterization applies at different dictionary sizes without retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same inner-product constraint could be tested in other overcomplete dictionary learning settings beyond SAEs.
  • Monitoring alignment scores during training might serve as an early diagnostic for whether a run will produce many dead features.
  • If the bimodality arises from gradient dynamics, similar geometric fixes might apply to related representation-learning methods.
  • Post-hoc feature pruning steps common in interpretability workflows could become less necessary.

Load-bearing premise

The assumption that the observed bimodal alignment distribution is a fixable degeneracy whose removal does not prevent the SAE from accurately representing the original activations.

What would settle it

Run aligned training and standard training on the same activation dataset; if the aligned version still produces a substantial fraction of dead features or shows worse reconstruction loss than the baseline, the claim that the constraint removes the root degeneracy would be falsified.

Figures

Figures reproduced from arXiv: 2605.18629 by Micha{\l} Brzozowski, Neo Christopher Chung.

Figure 1
Figure 1. Figure 1: Geometric interpretation of aligned training for a single feature [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Aligned training improves recovered cross-entropy across different sparsity levels. Dictio [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Aligned training improves TopK and BatchTopK autoencoders in the low-sparsity regime. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Aligned training reduces dead features to near zero without resampling or auxiliary losses. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The dead-feature reduction extends to TopK and BatchTopK. Dictionary size 65K, layer 12 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Aligned training significantly improves cross-seed stability for both ReLU and TopK [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reconstruction metrics for Pythia 160M (layer 8), dictionary size 4096, 3 random seeds. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Alive-feature fraction for Pythia 160M (layer 8), dictionary size 4096. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: SCR metric from SAEBench, dictionary size 65K, Pythia 160M and Gemma 2 2B. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Bimodality of SAE alignment scores across different models and architectures. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: MCS vs. alignment score (Pearson r = 0.65). The red vertical line marks ai = 1. C.3 Alignment Scores Are Correlated with Autointerpretability The alignment score is positively correlated with autointerpretability (Pearson r = 0.32; [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Autointerpretability vs. alignment score (Pearson [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Reconstruction metrics for Pythia 160M (layer 8) and Gemma 2 2B (layer 12), dictionary [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Weight tying reduces dead features but at the cost of reconstruction quality. Pythia 160M [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Reconstruction metrics, dictionary size 16384, Pythia 160M (layer 8) and Gemma 2 2B [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Alive-feature fraction, dictionary size 16384. [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Reconstruction metrics, dictionary size 65K. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Alive-feature fraction, dictionary size 65K. [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Reconstruction metrics at 500M tokens, dictionary size 65K, Gemma 2 2B (layer 12). [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Alive-feature fraction at 500M tokens, dictionary size 65K, Gemma 2 2B (layer 12). [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗
read the original abstract

Sparse autoencoders (SAEs) are one of the main methods to interpret the inner workings of deep neural networks (DNNs), decomposing activations into higher-dimensional features. However, they exhibit critical shortcomings where a large fraction of features are never activated and are unstable. Despite variants of SAEs that attempt to mitigate these issues, they require additional data, resampling, or training. We propose the \textbf{aligned training}, a parameter-free reparameterization of SAEs that simultaneously improves reconstruction quality, eliminates dead features, and significantly enhances stability across training seeds. Our approach is motivated by an overlooked observation that SAE feature quality, measured by the inner product between encoder and decoder directions (which we call the \textbf{alignment score}), follows a bimodal distribution across all modern architectures. The proposed aligned training enforces a geometric constraint between the encoder and decoder such that their inner product equals one for every feature, which removes a source of degeneracy in the SAE training without adding any hyperparameters. Across multiple models, dictionary sizes, and sparsity levels, the aligned training shows Pareto improvements on the SAEBench benchmarks. Beyond improving dead features, stability and reconstruction, our method readily integrates with techniques in mechanical interpretability such as Top/BatchTop-K architectures and p-Annealing. Overall, the aligned training substantially improves feature quality and stability of SAE without computational complexity or cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes aligned training, a parameter-free reparameterization of sparse autoencoders (SAEs) that enforces the inner product between encoder and decoder directions to equal exactly one for each feature. Motivated by an observed bimodal distribution of alignment scores, the method claims to remove a source of degeneracy in SAE training. It reports simultaneous improvements in reconstruction quality, elimination of dead features, and enhanced stability across training seeds, along with Pareto improvements on SAEBench benchmarks across models, dictionary sizes, and sparsity levels. The approach integrates with techniques such as Top-K and p-Annealing without added hyperparameters or computational cost.

Significance. If the central claim holds—that the hard geometric constraint directly fixes a degeneracy rather than merely altering optimization dynamics—this would represent a simple, hyperparameter-free improvement to a widely used tool in mechanistic interpretability. The reported compatibility with existing SAE variants and the absence of new hyperparameters are practical strengths that could facilitate adoption if the gains prove robust and mechanistically attributable to the alignment enforcement.

major comments (2)
  1. [Method] Method section (reparameterization description): The aligned training ties the decoder direction to the encoder such that their inner product is fixed at 1, which necessarily reduces the number of independent parameters relative to the standard untied SAE formulation. The paper attributes observed gains to removal of the low-alignment mode in the bimodal distribution, yet no ablation is described that enforces the same alignment=1 constraint via a soft penalty or post-update projection while preserving the original untied parameterization. Without this comparison, it remains unclear whether improvements stem from the claimed geometric degeneracy fix or from changes in gradient flow and effective degrees of freedom.
  2. [Experiments] Experiments and results sections: The central claim of Pareto improvements on SAEBench (reconstruction, dead features, stability) is load-bearing, but the manuscript does not report an explicit test of whether forcing alignment=1 compromises the SAE's ability to represent the underlying data distribution (e.g., via held-out reconstruction error or feature activation statistics under the constraint). The weakest assumption—that the bimodal distribution represents a fixable degeneracy rather than a natural outcome of optimization—requires direct empirical support through such a comparison.
minor comments (1)
  1. [Abstract] The abstract states improvements 'across all modern architectures' without listing the specific models, layers, or datasets used; adding this detail in the introduction or experimental setup would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, offering clarifications on the method and experiments while indicating revisions that will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method] Method section (reparameterization description): The aligned training ties the decoder direction to the encoder such that their inner product is fixed at 1, which necessarily reduces the number of independent parameters relative to the standard untied SAE formulation. The paper attributes observed gains to removal of the low-alignment mode in the bimodal distribution, yet no ablation is described that enforces the same alignment=1 constraint via a soft penalty or post-update projection while preserving the original untied parameterization. Without this comparison, it remains unclear whether improvements stem from the claimed geometric degeneracy fix or from changes in gradient flow and effective degrees of freedom.

    Authors: We acknowledge that the reparameterization reduces the number of independent parameters by design, as this is the mechanism by which the unit inner product is strictly enforced. Our central claim is that this hard geometric constraint directly eliminates the low-alignment mode observed in the bimodal distribution, rather than merely altering optimization dynamics. A soft penalty or post-hoc projection would require an additional hyperparameter (e.g., penalty weight or projection frequency), which would violate the parameter-free property of the method. We will revise the method section to explicitly discuss the relationship between the hard constraint, parameter count, and the observed degeneracy, including a clearer justification for preferring the reparameterization over soft alternatives. revision: partial

  2. Referee: [Experiments] Experiments and results sections: The central claim of Pareto improvements on SAEBench (reconstruction, dead features, stability) is load-bearing, but the manuscript does not report an explicit test of whether forcing alignment=1 compromises the SAE's ability to represent the underlying data distribution (e.g., via held-out reconstruction error or feature activation statistics under the constraint). The weakest assumption—that the bimodal distribution represents a fixable degeneracy rather than a natural outcome of optimization—requires direct empirical support through such a comparison.

    Authors: The reported Pareto improvements on SAEBench already include enhanced reconstruction quality across multiple settings, which is measured on data not used for training and thus provides indirect evidence that the constraint does not harm the ability to represent the data distribution. To directly address the concern, we will add an explicit comparison of held-out reconstruction error and feature activation statistics between aligned and standard SAEs in the revised experiments section. This addition will supply the requested empirical support for interpreting the bimodal distribution as a fixable degeneracy. revision: yes

Circularity Check

0 steps flagged

No circularity: aligned training is a direct reparameterization with empirical validation

full rationale

The paper introduces aligned training as a parameter-free reparameterization that directly enforces the encoder-decoder inner product to equal 1 for each feature. This is motivated by an observed bimodal distribution of alignment scores but does not derive any result or prediction from fitted parameters or prior outputs. The claimed Pareto improvements on SAEBench are presented as empirical outcomes across models and settings, not as quantities that reduce to the constraint by construction. No self-citation chain, uniqueness theorem, or ansatz smuggling supports the central mechanism; the approach is self-contained as an engineering change to the SAE parameterization without load-bearing external citations or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method relies on the domain assumption about the alignment score distribution and the benefit of enforcing the constraint.

axioms (1)
  • domain assumption SAE feature quality is measured by the inner product between encoder and decoder directions following a bimodal distribution.
    This observation motivates the method and is stated as overlooked in modern architectures.

pith-pipeline@v0.9.0 · 5783 in / 1213 out tokens · 39531 ms · 2026-05-20T12:11:38.797735+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    Addressing feature suppression in saes.AI Alignment Forum, 2024

    Lee Sharkey Benjamin Wright. Addressing feature suppression in saes.AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/ addressing-feature-suppression-in-saes

  2. [2]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023. 9

  3. [3]

    Language models can explain neu- rons in language models, 2023

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neu- rons in language models, 2023. URL https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html

  4. [4]

    Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Saelens. https://github. com/jbloomAus/SAELens, 2024

  5. [5]

    Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

  6. [6]

    Batchtopk sparse autoencoders

    Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024. URL https: //openreview.net/forum?id=d4dpOCqybL

  7. [7]

    Update on dictionary learning improvements.Transformer Circuits Thread, 2024

    Tom Conerly, Adly Templeton, Trenton Bricken, Jonathan Marcus, and Tom Henighan. Update on dictionary learning improvements.Transformer Circuits Thread, 2024. URL https: //transformer-circuits.pub/2024/april-update/index.html#training-saes

  8. [8]

    Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023

    Hoagy Cunningham. Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/ursraZGcpfMjCXtnn/ autointerpretation-finds-sparse-coding-beats-alternatives

  9. [9]

    [replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023

    Hoagy Cunningham and Logan Riggs. [replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023. URL https://www.lesswrong.com/posts/vBcsAw4rvLsri3JAj/ replication-conjecture-s-sparse-coding-in-small-transformers

  10. [10]

    Toy models of superposition.Transformer Circuits Thread, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URL https: //transformer-circuits.pub/20...

  11. [11]

    Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988

    Kunihiko Fukushima. Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988. ISSN 0893-6080. doi: https://doi. org/10.1016/0893-6080(88)90014-7. URL https://www.sciencedirect.com/science/ article/pii/0893608088900147

  12. [12]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  13. [13]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=tcsZt9ZNKD

  14. [14]

    [research update] sparse autoencoder features are bimodal.From AI to ZI, 2023

    Robert Huben. [research update] sparse autoencoder features are bimodal.From AI to ZI, 2023. URLhttps://aizi.substack.com/p/research-update-sparse-autoencoder

  15. [15]

    Sparse autoencoders find highly interpretable features in language models

    Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2023

  16. [16]

    Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024

    Adam Jermyn and Adly Templeton. Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/ index.html#dict-learning-resampling. 10

  17. [17]

    Evaluating sparse autoencoders on targeted concept erasure tasks, 2024

    Adam Karvonen, Can Rager, Samuel Marks, and Neel Nanda. Evaluating sparse autoencoders on targeted concept erasure tasks, 2024. URLhttps://arxiv.org/abs/2411.18895

  18. [18]

    Measuring progress in dictionary learning for language model interpretability with board game models

    Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learning for language model interpretability with board game models. InICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum? id=qzsDKwGJyB

  19. [19]

    Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025

    Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URL https://arxiv.org/abs/2503. 09532

  20. [20]

    Lecun, L

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

  21. [21]

    Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024

    Luke Marks, Alasdair Paren, David Krueger, and Fazl Barez. Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024. URL https://arxiv.org/ abs/2411.01220

  22. [22]

    Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

    Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=I4e82CIDxv

  23. [23]

    Sparse autoencoders trained on the same data learn different features

    Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=EjInprGpk9

  24. [24]

    Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders

    Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sy...

  25. [25]

    Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025

    Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025. URL https://openreview.net/forum?id= mMPaQzgzAN

  26. [26]

    (tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023

    Logan Riggs. (tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/wqRqb7h6ZC48iDgfK/ tentatively-found-600-monosemantic-features-in-a-small-lm

  27. [27]

    Einops: Clear and reliable tensor manipulations with einstein-like notation

    Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=oapKSVM2bcj

  28. [28]

    dictionary_learning, 2024

    Adam Karvonen Samuel Marks and Aaron Mueller. dictionary_learning, 2024. URL https: //github.com/saprmarks/dictionary_learning

  29. [29]

    Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023

    Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/ interim-research-report-taking-features-out-of-superposition. 11

  30. [30]

    Diab, Virginia Smith, and Kun Zhang

    Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. Position: Mechanistic interpretability should prioritize feature consistency in SAEs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. URL https://openreview.net/forum?id=d9ACURK6bI

  31. [31]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

  32. [32]

    activation_dim -1␣ dict_size

    Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering LLMs? even simple base- lines outperform sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=K2CckZjNy0. A Implementation Details All SAEs ...