Toward Identifiable Sparse Autoencoders
Pith reviewed 2026-06-28 22:59 UTC · model grok-4.3
The pith
Minimal changes to TopK sparse autoencoders yield stable and near-identifiable models
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By introducing minimal changes to the standard TopK SAE architecture and training procedure, the authors create two versions of an identifiable SAE (iSAE) that achieve lower reconstruction error and improved stability across training runs. They connect SAEs to traditional dictionary learning and demonstrate that the learned dictionaries satisfy an approximate restricted isometry condition, which renders the sparse codes near-identifiable.
What carries the argument
The iSAE variant of TopK SAE, whose learned dictionary satisfies an approximate restricted isometry condition to ensure near-identifiability of sparse codes
If this is right
- iSAEs exhibit improved stability, producing consistent dictionaries and codes across different training runs
- The modifications result in lower reconstruction error compared to standard TopK SAEs
- Sparse codes in iSAEs are near-identifiable due to the dictionary properties
- The approach links sparse autoencoders to classical dictionary learning for theoretical analysis
Where Pith is reading between the lines
- If iSAEs become standard, mechanistic interpretability studies could rely on more reproducible feature dictionaries
- Similar stability improvements might be applicable to other sparse coding methods in machine learning
- This could enable more reliable scaling of interpretability techniques to larger models
Load-bearing premise
Dictionaries learned by the modified SAEs in practice satisfy an approximate restricted isometry condition
What would settle it
Running multiple independent trainings of the iSAE and checking whether the resulting dictionaries and sparse codes are highly similar or identical; alternatively, verifying whether the learned dictionary matrix satisfies the approximate restricted isometry property
Figures
read the original abstract
Recently, sparse autoencoders (SAEs) have emerged as an attractive tool for interpreting and interacting with representations in practical neural networks. While it is common empirical folklore, we also show theoretically that SAEs are highly unstable: different training runs are likely to produce different concept dictionaries and sparse codes. We characterize the model properties that hinder the stability of real-world SAEs, and address each of these problems through minimal changes to the architecture and training procedure. Together, these changes yield two versions of an \textbf{i}dentifiable SAE (iSAE), a variant of the standard TopK SAE with lower reconstruction error and improved stability. We explain this improvement theoretically by connecting SAEs with traditional dictionary learning approaches, and show that the dictionaries learned in practice satisfy an approximate restricted isometry condition, rendering the corresponding sparse codes in those models near-identifiable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard sparse autoencoders (SAEs) are highly unstable across training runs, theoretically characterizes the model properties responsible, and proposes minimal changes to architecture and training that produce two variants of an identifiable SAE (iSAE) with lower reconstruction error and improved stability. It connects SAEs to classical dictionary learning, asserts that the learned dictionaries satisfy an approximate restricted isometry condition (RIC), and concludes that the resulting sparse codes are therefore near-identifiable.
Significance. If the RIC claim is placed on a quantitative footing that ties the observed constant to sparsity level k and recovery error, the work would supply a concrete theoretical explanation for SAE instability and a practical route to more stable, interpretable dictionaries; the explicit linkage to dictionary-learning recovery guarantees is a strength that could influence how future SAE training objectives are designed.
major comments (2)
- [Empirical verification of the RIC (section discussing dictionary properties and identifiability)] The assertion that learned dictionaries satisfy an approximate restricted isometry condition (invoked to conclude near-identifiability of the sparse codes) is load-bearing for the central theoretical claim, yet the manuscript reports only that the condition “holds approximately” without measured values of δ_{2k} or a demonstration that δ_{2k} is small enough relative to the observed sparsity k to satisfy standard dictionary-learning recovery bounds (e.g., δ_{2k} < 1/3 for basis pursuit).
- [Theoretical characterization of instability] The theoretical characterization of SAE instability is stated in the abstract and used to motivate the architectural changes, but the manuscript provides no explicit derivation or equation showing how the identified model properties (e.g., non-identifiability of the dictionary or lack of RIP) produce the observed run-to-run variability in concept dictionaries.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights opportunities to strengthen the empirical and theoretical foundations of our claims regarding identifiable sparse autoencoders. We address each major comment below and will incorporate revisions accordingly.
read point-by-point responses
-
Referee: [Empirical verification of the RIC (section discussing dictionary properties and identifiability)] The assertion that learned dictionaries satisfy an approximate restricted isometry condition (invoked to conclude near-identifiability of the sparse codes) is load-bearing for the central theoretical claim, yet the manuscript reports only that the condition “holds approximately” without measured values of δ_{2k} or a demonstration that δ_{2k} is small enough relative to the observed sparsity k to satisfy standard dictionary-learning recovery bounds (e.g., δ_{2k} < 1/3 for basis pursuit).
Authors: We agree that quantitative verification of the RIC is necessary to make the identifiability claim rigorous. In the revised manuscript we will add explicit computations of δ_{2k} on the learned dictionaries from multiple runs, report the observed values as a function of k, and verify that they fall below standard recovery thresholds (e.g., δ_{2k} < 1/3) sufficient for basis pursuit guarantees. This will directly link the constant to sparsity level and reconstruction error. revision: yes
-
Referee: [Theoretical characterization of instability] The theoretical characterization of SAE instability is stated in the abstract and used to motivate the architectural changes, but the manuscript provides no explicit derivation or equation showing how the identified model properties (e.g., non-identifiability of the dictionary or lack of RIP) produce the observed run-to-run variability in concept dictionaries.
Authors: The manuscript motivates the instability claim via the connection to non-unique dictionary recovery in the absence of RIP, but we acknowledge that an explicit step-by-step derivation linking these properties to run-to-run variability is not presented with dedicated equations. We will add a short subsection in the revision that derives the multiplicity of consistent dictionaries under violated RIP and shows how this induces the observed variability across random initializations. revision: yes
Circularity Check
No significant circularity; derivation relies on external dictionary learning connections and empirical RIC checks
full rationale
The paper's central chain derives SAE instability theoretically from model properties, introduces minimal architectural and training changes to produce iSAE variants, then invokes standard results from traditional dictionary learning to explain improved stability. It reports that the learned dictionaries satisfy an approximate restricted isometry condition via direct inspection of the matrices obtained in practice, rather than by redefining identifiability in terms of the fitted parameters or renaming a fitted quantity as a prediction. No self-citation is shown to be load-bearing for the identifiability claim, and no step reduces the conclusion to a tautology or ansatz smuggled through prior author work. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dictionaries learned by the modified SAEs satisfy an approximate restricted isometry condition
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/math/ 0503066. Chen, S. and Donoho, D. Basis pursuit. InProceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, volume 1, pp. 41–44 vol.1, 1994. doi: 10.1109/ACSSC.1994.471413. Chen, S., Billings, S. A., and Luo, W. Orthogonal least squares methods and their application to non-linear sys- tem identificatio...
-
[2]
Zoom in: An introduction to circuits
URL https://openreview.net/forum? id=mQxt8l7JL04. Li, A. J., Srinivas, S., Bhalla, U., and Lakkaraju, H. Eval- uating adversarial robustness of concept representations in sparse autoencoders, 2026. URL https://arxiv. org/abs/2505.16004. Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Sch¨olkopf, B., and Bachem, O. Challenging common assumptio...
-
[3]
This means the nonzero principal angles betweenU S andU S′ are exactly the principal angles betweenU 1 andU ′
=k− |I| . This means the nonzero principal angles betweenU S andU S′ are exactly the principal angles betweenU 1 andU ′
-
[4]
Denote byP I the orthogonal projector ontoU I
In particular, we have the claim. Denote byP I the orthogonal projector ontoU I. Claim.The projected dictionary(I−P I)DA∪B satisfies RIP at the same levelδ. Proof of claim.Let zA∪B ∈R |A∪B| denote an arbitrary vector and let zI denote the least-squares minimizer of ∥DA∪BzA∪B −D I zI ∥2. Let r=D A∪BzA∪B −D I zI = (I−P I)DA∪BzA∪B ∈ U ⊥ I be the residual. St...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.