Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

Micha{\l} Brzozowski; Neo Christopher Chung

arxiv: 2605.18629 · v1 · pith:5FYGF45Jnew · submitted 2026-05-18 · 💻 cs.LG

Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

Micha{\l} Brzozowski , Neo Christopher Chung This is my paper

Pith reviewed 2026-05-20 12:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse autoencodersneural network interpretabilitydead featurestraining stabilityalignment scorereparameterizationdictionary learningSAEBench

0 comments

The pith

Reparameterizing sparse autoencoders to force the inner product of each encoder and decoder direction to equal one removes a source of training degeneracy and yields better features without new hyperparameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard SAE training produces a bimodal distribution of alignment scores between encoder and decoder weights for each feature, leaving many features poorly aligned. This misalignment is tied to dead features that stay inactive and to high variance across random seeds. The aligned training method reparameterizes the model so that every feature's encoder and decoder directions satisfy an inner-product constraint of exactly one. The resulting models show higher reconstruction fidelity, near-zero dead features, and greater stability on SAEBench across different models, dictionary sizes, and sparsity targets. The change adds no extra data, resampling steps, or tunable parameters and works alongside existing SAE improvements such as Top-K and p-annealing.

Core claim

The paper establishes that the overlooked bimodality in alignment scores (inner product of encoder and decoder directions) is a controllable source of degeneracy. By enforcing the geometric constraint that this inner product equals one for every feature through a simple reparameterization, the training dynamics are altered so that dead features disappear, reconstruction quality rises, and run-to-run stability improves, all without introducing hyperparameters or extra computational cost.

What carries the argument

The aligned training reparameterization, which directly constrains the encoder-decoder inner product to equal one for each feature and thereby fixes the geometric relationship between the learned directions.

If this is right

SAEs trained with the constraint achieve Pareto improvements on reconstruction-versus-sparsity trade-offs.
Dead features are eliminated across multiple model families and sparsity regimes without resampling or auxiliary losses.
Feature sets become more stable across different random seeds, reducing the need for seed averaging.
The method composes directly with Top-K, BatchTop-K, and p-annealing architectures.
The same reparameterization applies at different dictionary sizes without retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inner-product constraint could be tested in other overcomplete dictionary learning settings beyond SAEs.
Monitoring alignment scores during training might serve as an early diagnostic for whether a run will produce many dead features.
If the bimodality arises from gradient dynamics, similar geometric fixes might apply to related representation-learning methods.
Post-hoc feature pruning steps common in interpretability workflows could become less necessary.

Load-bearing premise

The assumption that the observed bimodal alignment distribution is a fixable degeneracy whose removal does not prevent the SAE from accurately representing the original activations.

What would settle it

Run aligned training and standard training on the same activation dataset; if the aligned version still produces a substantial fraction of dead features or shows worse reconstruction loss than the baseline, the claim that the constraint removes the root degeneracy would be falsified.

Figures

Figures reproduced from arXiv: 2605.18629 by Micha{\l} Brzozowski, Neo Christopher Chung.

**Figure 2.** Figure 2: Aligned training improves recovered cross-entropy across different sparsity levels. Dictio [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Aligned training improves TopK and BatchTopK autoencoders in the low-sparsity regime. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Aligned training reduces dead features to near zero without resampling or auxiliary losses. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The dead-feature reduction extends to TopK and BatchTopK. Dictionary size 65K, layer 12 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Aligned training significantly improves cross-seed stability for both ReLU and TopK [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Reconstruction metrics for Pythia 160M (layer 8), dictionary size 4096, 3 random seeds. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Alive-feature fraction for Pythia 160M (layer 8), dictionary size 4096. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: SCR metric from SAEBench, dictionary size 65K, Pythia 160M and Gemma 2 2B. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Bimodality of SAE alignment scores across different models and architectures. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: MCS vs. alignment score (Pearson r = 0.65). The red vertical line marks ai = 1. C.3 Alignment Scores Are Correlated with Autointerpretability The alignment score is positively correlated with autointerpretability (Pearson r = 0.32; [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Autointerpretability vs. alignment score (Pearson [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Reconstruction metrics for Pythia 160M (layer 8) and Gemma 2 2B (layer 12), dictionary [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Weight tying reduces dead features but at the cost of reconstruction quality. Pythia 160M [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Reconstruction metrics, dictionary size 16384, Pythia 160M (layer 8) and Gemma 2 2B [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Alive-feature fraction, dictionary size 16384. [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: Reconstruction metrics, dictionary size 65K. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Alive-feature fraction, dictionary size 65K. [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: Reconstruction metrics at 500M tokens, dictionary size 65K, Gemma 2 2B (layer 12). [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗

**Figure 20.** Figure 20: Alive-feature fraction at 500M tokens, dictionary size 65K, Gemma 2 2B (layer 12). [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗

read the original abstract

Sparse autoencoders (SAEs) are one of the main methods to interpret the inner workings of deep neural networks (DNNs), decomposing activations into higher-dimensional features. However, they exhibit critical shortcomings where a large fraction of features are never activated and are unstable. Despite variants of SAEs that attempt to mitigate these issues, they require additional data, resampling, or training. We propose the \textbf{aligned training}, a parameter-free reparameterization of SAEs that simultaneously improves reconstruction quality, eliminates dead features, and significantly enhances stability across training seeds. Our approach is motivated by an overlooked observation that SAE feature quality, measured by the inner product between encoder and decoder directions (which we call the \textbf{alignment score}), follows a bimodal distribution across all modern architectures. The proposed aligned training enforces a geometric constraint between the encoder and decoder such that their inner product equals one for every feature, which removes a source of degeneracy in the SAE training without adding any hyperparameters. Across multiple models, dictionary sizes, and sparsity levels, the aligned training shows Pareto improvements on the SAEBench benchmarks. Beyond improving dead features, stability and reconstruction, our method readily integrates with techniques in mechanical interpretability such as Top/BatchTop-K architectures and p-Annealing. Overall, the aligned training substantially improves feature quality and stability of SAE without computational complexity or cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is a parameter-free reparameterization that forces encoder-decoder inner product to exactly 1 per feature, motivated by a bimodal alignment distribution, and it reports cleaner features plus better stability on SAEBench.

read the letter

The central claim is that SAE training has a hidden degeneracy visible in the bimodal spread of alignment scores between encoder and decoder directions. By reparameterizing so the decoder is tied directly to the encoder to enforce alignment of 1, the method removes that mode without introducing any new hyperparameters or extra data passes. The abstract says this yields Pareto gains on reconstruction, dead features, and seed-to-seed stability across models and sparsity settings, and that it slots in with Top-K and p-Annealing variants already in use. That combination of simplicity and reported breadth is what makes the work worth a look. The reparameterization is genuinely new in the SAE literature they cite, and the geometric motivation from the observed bimodality is a clean observation that prior work had not acted on. If the full experiments back the abstract with consistent numbers across dictionary sizes, the practical payoff for people running SAEs is real: fewer wasted features and less need to rerun seeds. The approach also stays cheap, which matters when people are already scaling these things. The main soft spot is that the reparameterization necessarily collapses some degrees of freedom and alters how gradients reach the weights. The paper pins the improvement on removing the low-alignment mode, yet without an explicit ablation that applies the same constraint through a penalty or projection while keeping the original untied form, it is hard to separate the geometric fix from changes in optimization dynamics. The abstract does not include error bars or full training curves, so the strength of the Pareto claim rests on whatever tables and figures appear in the body. Readers already working with SAEs for mechanistic interpretability will get immediate value from testing the method on their own setups. It is a modest but concrete step rather than a wholesale redesign, and the claims are narrow enough to be checked quickly against existing benchmarks. I would send this to peer review; the core idea is testable and the implementation cost is low enough that referees can focus on whether the gains are robust rather than on whether the method is worth trying at all.

Referee Report

2 major / 1 minor

Summary. The paper proposes aligned training, a parameter-free reparameterization of sparse autoencoders (SAEs) that enforces the inner product between encoder and decoder directions to equal exactly one for each feature. Motivated by an observed bimodal distribution of alignment scores, the method claims to remove a source of degeneracy in SAE training. It reports simultaneous improvements in reconstruction quality, elimination of dead features, and enhanced stability across training seeds, along with Pareto improvements on SAEBench benchmarks across models, dictionary sizes, and sparsity levels. The approach integrates with techniques such as Top-K and p-Annealing without added hyperparameters or computational cost.

Significance. If the central claim holds—that the hard geometric constraint directly fixes a degeneracy rather than merely altering optimization dynamics—this would represent a simple, hyperparameter-free improvement to a widely used tool in mechanistic interpretability. The reported compatibility with existing SAE variants and the absence of new hyperparameters are practical strengths that could facilitate adoption if the gains prove robust and mechanistically attributable to the alignment enforcement.

major comments (2)

[Method] Method section (reparameterization description): The aligned training ties the decoder direction to the encoder such that their inner product is fixed at 1, which necessarily reduces the number of independent parameters relative to the standard untied SAE formulation. The paper attributes observed gains to removal of the low-alignment mode in the bimodal distribution, yet no ablation is described that enforces the same alignment=1 constraint via a soft penalty or post-update projection while preserving the original untied parameterization. Without this comparison, it remains unclear whether improvements stem from the claimed geometric degeneracy fix or from changes in gradient flow and effective degrees of freedom.
[Experiments] Experiments and results sections: The central claim of Pareto improvements on SAEBench (reconstruction, dead features, stability) is load-bearing, but the manuscript does not report an explicit test of whether forcing alignment=1 compromises the SAE's ability to represent the underlying data distribution (e.g., via held-out reconstruction error or feature activation statistics under the constraint). The weakest assumption—that the bimodal distribution represents a fixable degeneracy rather than a natural outcome of optimization—requires direct empirical support through such a comparison.

minor comments (1)

[Abstract] The abstract states improvements 'across all modern architectures' without listing the specific models, layers, or datasets used; adding this detail in the introduction or experimental setup would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, offering clarifications on the method and experiments while indicating revisions that will strengthen the manuscript.

read point-by-point responses

Referee: [Method] Method section (reparameterization description): The aligned training ties the decoder direction to the encoder such that their inner product is fixed at 1, which necessarily reduces the number of independent parameters relative to the standard untied SAE formulation. The paper attributes observed gains to removal of the low-alignment mode in the bimodal distribution, yet no ablation is described that enforces the same alignment=1 constraint via a soft penalty or post-update projection while preserving the original untied parameterization. Without this comparison, it remains unclear whether improvements stem from the claimed geometric degeneracy fix or from changes in gradient flow and effective degrees of freedom.

Authors: We acknowledge that the reparameterization reduces the number of independent parameters by design, as this is the mechanism by which the unit inner product is strictly enforced. Our central claim is that this hard geometric constraint directly eliminates the low-alignment mode observed in the bimodal distribution, rather than merely altering optimization dynamics. A soft penalty or post-hoc projection would require an additional hyperparameter (e.g., penalty weight or projection frequency), which would violate the parameter-free property of the method. We will revise the method section to explicitly discuss the relationship between the hard constraint, parameter count, and the observed degeneracy, including a clearer justification for preferring the reparameterization over soft alternatives. revision: partial
Referee: [Experiments] Experiments and results sections: The central claim of Pareto improvements on SAEBench (reconstruction, dead features, stability) is load-bearing, but the manuscript does not report an explicit test of whether forcing alignment=1 compromises the SAE's ability to represent the underlying data distribution (e.g., via held-out reconstruction error or feature activation statistics under the constraint). The weakest assumption—that the bimodal distribution represents a fixable degeneracy rather than a natural outcome of optimization—requires direct empirical support through such a comparison.

Authors: The reported Pareto improvements on SAEBench already include enhanced reconstruction quality across multiple settings, which is measured on data not used for training and thus provides indirect evidence that the constraint does not harm the ability to represent the data distribution. To directly address the concern, we will add an explicit comparison of held-out reconstruction error and feature activation statistics between aligned and standard SAEs in the revised experiments section. This addition will supply the requested empirical support for interpreting the bimodal distribution as a fixable degeneracy. revision: yes

Circularity Check

0 steps flagged

No circularity: aligned training is a direct reparameterization with empirical validation

full rationale

The paper introduces aligned training as a parameter-free reparameterization that directly enforces the encoder-decoder inner product to equal 1 for each feature. This is motivated by an observed bimodal distribution of alignment scores but does not derive any result or prediction from fitted parameters or prior outputs. The claimed Pareto improvements on SAEBench are presented as empirical outcomes across models and settings, not as quantities that reduce to the constraint by construction. No self-citation chain, uniqueness theorem, or ansatz smuggling supports the central mechanism; the approach is self-contained as an engineering change to the SAE parameterization without load-bearing external citations or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method relies on the domain assumption about the alignment score distribution and the benefit of enforcing the constraint.

axioms (1)

domain assumption SAE feature quality is measured by the inner product between encoder and decoder directions following a bimodal distribution.
This observation motivates the method and is stated as overlooked in modern architectures.

pith-pipeline@v0.9.0 · 5783 in / 1213 out tokens · 39531 ms · 2026-05-20T12:11:38.797735+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel / Jcost_unit0 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The proposed aligned training enforces a geometric constraint between the encoder and decoder such that their inner product equals one for every feature... W_enc_i,· · W_dec_·,i = 1 for every feature i by construction.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Toy model... perfect reconstruction forces the alignment score to one.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

[1]

Addressing feature suppression in saes.AI Alignment Forum, 2024

Lee Sharkey Benjamin Wright. Addressing feature suppression in saes.AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/ addressing-feature-suppression-in-saes

work page 2024
[2]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023. 9

work page 2023
[3]

Language models can explain neu- rons in language models, 2023

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neu- rons in language models, 2023. URL https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html

work page 2023
[4]

Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Saelens. https://github. com/jbloomAus/SAELens, 2024

work page 2024
[5]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

work page 2023
[6]

Batchtopk sparse autoencoders

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024. URL https: //openreview.net/forum?id=d4dpOCqybL

work page 2024
[7]

Update on dictionary learning improvements.Transformer Circuits Thread, 2024

Tom Conerly, Adly Templeton, Trenton Bricken, Jonathan Marcus, and Tom Henighan. Update on dictionary learning improvements.Transformer Circuits Thread, 2024. URL https: //transformer-circuits.pub/2024/april-update/index.html#training-saes

work page 2024
[8]

Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023

Hoagy Cunningham. Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/ursraZGcpfMjCXtnn/ autointerpretation-finds-sparse-coding-beats-alternatives

work page 2023
[9]

[replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023

Hoagy Cunningham and Logan Riggs. [replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023. URL https://www.lesswrong.com/posts/vBcsAw4rvLsri3JAj/ replication-conjecture-s-sparse-coding-in-small-transformers

work page 2023
[10]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URL https: //transformer-circuits.pub/20...

work page 2022
[11]

Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988

Kunihiko Fukushima. Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988. ISSN 0893-6080. doi: https://doi. org/10.1016/0893-6080(88)90014-7. URL https://www.sciencedirect.com/science/ article/pii/0893608088900147

work page doi:10.1016/0893-6080(88)90014-7 1988
[12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[13]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=tcsZt9ZNKD

work page 2025
[14]

[research update] sparse autoencoder features are bimodal.From AI to ZI, 2023

Robert Huben. [research update] sparse autoencoder features are bimodal.From AI to ZI, 2023. URLhttps://aizi.substack.com/p/research-update-sparse-autoencoder

work page 2023
[15]

Sparse autoencoders find highly interpretable features in language models

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[16]

Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024

Adam Jermyn and Adly Templeton. Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/ index.html#dict-learning-resampling. 10

work page 2024
[17]

Evaluating sparse autoencoders on targeted concept erasure tasks, 2024

Adam Karvonen, Can Rager, Samuel Marks, and Neel Nanda. Evaluating sparse autoencoders on targeted concept erasure tasks, 2024. URLhttps://arxiv.org/abs/2411.18895

work page arXiv 2024
[18]

Measuring progress in dictionary learning for language model interpretability with board game models

Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learning for language model interpretability with board game models. InICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum? id=qzsDKwGJyB

work page 2024
[19]

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URL https://arxiv.org/abs/2503. 09532

work page 2025
[20]

Lecun, L

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998
[21]

Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024

Luke Marks, Alasdair Paren, David Krueger, and Fazl Barez. Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024. URL https://arxiv.org/ abs/2411.01220

work page arXiv 2024
[22]

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=I4e82CIDxv

work page 2025
[23]

Sparse autoencoders trained on the same data learn different features

Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=EjInprGpk9

work page 2026
[24]

Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sy...

work page 2024
[25]

Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025. URL https://openreview.net/forum?id= mMPaQzgzAN

work page 2025
[26]

(tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023

Logan Riggs. (tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/wqRqb7h6ZC48iDgfK/ tentatively-found-600-monosemantic-features-in-a-small-lm

work page 2023
[27]

Einops: Clear and reliable tensor manipulations with einstein-like notation

Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=oapKSVM2bcj

work page 2022
[28]

dictionary_learning, 2024

Adam Karvonen Samuel Marks and Aaron Mueller. dictionary_learning, 2024. URL https: //github.com/saprmarks/dictionary_learning

work page 2024
[29]

Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023

Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/ interim-research-report-taking-features-out-of-superposition. 11

work page 2023
[30]

Diab, Virginia Smith, and Kun Zhang

Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. Position: Mechanistic interpretability should prioritize feature consistency in SAEs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. URL https://openreview.net/forum?id=d9ACURK6bI

work page 2025
[31]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

activation_dim -1␣ dict_size

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering LLMs? even simple base- lines outperform sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=K2CckZjNy0. A Implementation Details All SAEs ...

work page 2025

[1] [1]

Addressing feature suppression in saes.AI Alignment Forum, 2024

Lee Sharkey Benjamin Wright. Addressing feature suppression in saes.AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/ addressing-feature-suppression-in-saes

work page 2024

[2] [2]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023. 9

work page 2023

[3] [3]

Language models can explain neu- rons in language models, 2023

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neu- rons in language models, 2023. URL https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html

work page 2023

[4] [4]

Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Saelens. https://github. com/jbloomAus/SAELens, 2024

work page 2024

[5] [5]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

work page 2023

[6] [6]

Batchtopk sparse autoencoders

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024. URL https: //openreview.net/forum?id=d4dpOCqybL

work page 2024

[7] [7]

Update on dictionary learning improvements.Transformer Circuits Thread, 2024

Tom Conerly, Adly Templeton, Trenton Bricken, Jonathan Marcus, and Tom Henighan. Update on dictionary learning improvements.Transformer Circuits Thread, 2024. URL https: //transformer-circuits.pub/2024/april-update/index.html#training-saes

work page 2024

[8] [8]

Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023

Hoagy Cunningham. Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/ursraZGcpfMjCXtnn/ autointerpretation-finds-sparse-coding-beats-alternatives

work page 2023

[9] [9]

[replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023

Hoagy Cunningham and Logan Riggs. [replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023. URL https://www.lesswrong.com/posts/vBcsAw4rvLsri3JAj/ replication-conjecture-s-sparse-coding-in-small-transformers

work page 2023

[10] [10]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URL https: //transformer-circuits.pub/20...

work page 2022

[11] [11]

Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988

Kunihiko Fukushima. Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988. ISSN 0893-6080. doi: https://doi. org/10.1016/0893-6080(88)90014-7. URL https://www.sciencedirect.com/science/ article/pii/0893608088900147

work page doi:10.1016/0893-6080(88)90014-7 1988

[12] [12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[13] [13]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=tcsZt9ZNKD

work page 2025

[14] [14]

[research update] sparse autoencoder features are bimodal.From AI to ZI, 2023

Robert Huben. [research update] sparse autoencoder features are bimodal.From AI to ZI, 2023. URLhttps://aizi.substack.com/p/research-update-sparse-autoencoder

work page 2023

[15] [15]

Sparse autoencoders find highly interpretable features in language models

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[16] [16]

Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024

Adam Jermyn and Adly Templeton. Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/ index.html#dict-learning-resampling. 10

work page 2024

[17] [17]

Evaluating sparse autoencoders on targeted concept erasure tasks, 2024

Adam Karvonen, Can Rager, Samuel Marks, and Neel Nanda. Evaluating sparse autoencoders on targeted concept erasure tasks, 2024. URLhttps://arxiv.org/abs/2411.18895

work page arXiv 2024

[18] [18]

Measuring progress in dictionary learning for language model interpretability with board game models

Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learning for language model interpretability with board game models. InICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum? id=qzsDKwGJyB

work page 2024

[19] [19]

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URL https://arxiv.org/abs/2503. 09532

work page 2025

[20] [20]

Lecun, L

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998

[21] [21]

Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024

Luke Marks, Alasdair Paren, David Krueger, and Fazl Barez. Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024. URL https://arxiv.org/ abs/2411.01220

work page arXiv 2024

[22] [22]

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=I4e82CIDxv

work page 2025

[23] [23]

Sparse autoencoders trained on the same data learn different features

Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=EjInprGpk9

work page 2026

[24] [24]

Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sy...

work page 2024

[25] [25]

Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025. URL https://openreview.net/forum?id= mMPaQzgzAN

work page 2025

[26] [26]

(tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023

Logan Riggs. (tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/wqRqb7h6ZC48iDgfK/ tentatively-found-600-monosemantic-features-in-a-small-lm

work page 2023

[27] [27]

Einops: Clear and reliable tensor manipulations with einstein-like notation

Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=oapKSVM2bcj

work page 2022

[28] [28]

dictionary_learning, 2024

Adam Karvonen Samuel Marks and Aaron Mueller. dictionary_learning, 2024. URL https: //github.com/saprmarks/dictionary_learning

work page 2024

[29] [29]

Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023

Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/ interim-research-report-taking-features-out-of-superposition. 11

work page 2023

[30] [30]

Diab, Virginia Smith, and Kun Zhang

Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. Position: Mechanistic interpretability should prioritize feature consistency in SAEs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. URL https://openreview.net/forum?id=d9ACURK6bI

work page 2025

[31] [31]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

activation_dim -1␣ dict_size

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering LLMs? even simple base- lines outperform sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=K2CckZjNy0. A Implementation Details All SAEs ...

work page 2025