Recognition: no theorem link
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Pith reviewed 2026-05-12 03:31 UTC · model grok-4.3
The pith
Tree SAE learns accurate hierarchical feature structures by combining activation coverage with a reconstruction constraint.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tree SAE learns hierarchical structures directly from within the feature set by enforcing both an activation constraint, in which child features activate only when their parents activate, and a novel reconstruction constraint that creates a functional link between levels. Activation coverage alone generates false positives for semantically unrelated pairs, while the added reconstruction condition eliminates many of those errors. Experiments show Tree SAEs recover significantly more true hierarchical pairs than prior methods while remaining competitive with state-of-the-art SAEs on standard benchmarks for feature quality and downstream performance. The approach also enables direct mapping of
What carries the argument
Tree SAE's dual-constraint mechanism, where the activation condition requires conditional child activation and the reconstruction condition requires that parent reconstruction error decreases specifically when child features are included.
If this is right
- Tree SAEs recover substantially more true hierarchical pairs than activation-coverage methods alone.
- Performance on standard SAE benchmarks for feature sparsity and reconstruction remains competitive with existing models.
- The model directly supports geometric analysis of the subspaces spanned by child features.
- It uncovers multi-level concept hierarchies encoded inside large language models.
Where Pith is reading between the lines
- If the reconstruction link is robust, the same constraints could be applied to earlier layers to trace how low-level features compose into higher-level abstractions.
- Reducing false positives in hierarchy detection may also lessen related problems such as feature splitting in standard sparse autoencoders.
- Extending the method beyond pairs to full trees would allow explicit extraction of multi-level concept organizations.
Load-bearing premise
The reconstruction improvement when child features are added reliably indicates a true hierarchical relationship rather than some unrelated dependency.
What would settle it
A controlled dataset of ground-truth hierarchical and non-hierarchical feature pairs on which Tree SAE shows no statistically significant gain in precision over activation-only baselines would falsify the claim that the combined constraints are necessary.
Figures
read the original abstract
Learning hierarchical features in Sparse Autoencoders (SAEs) is essential for capturing the structured nature of real-world data and mitigating issues like feature absorption or splitting. Existing works attempt to identify hierarchical relationships within independent feature sets by relying on activation coverage, the assumption that child feature should only activate when its parent feature activates. However, we demonstrate that this condition alone is insufficient; that is, it often produces false positives where parent and child concepts are semantically unrelated. To address this, we introduce a novel reconstruction condition that enforces a deeper functional link between hierarchical levels. By combining both activation and reconstruction constraints, we propose the Tree SAE, a model designed to learn hierarchical structures directly from within the feature set. Our results demonstrate that Tree SAEs significantly surpass the existing SAEs at learning hierarchical pairs while maintaining competitive performance to the state-of-the-art on several key benchmarks. Finally, we demonstrate the practical utility of our Tree SAE in mapping the geometry of child feature subspaces and uncovering the complex hierarchical concept structures encoded within large language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Tree SAE, a sparse autoencoder variant that augments the standard activation-coverage criterion for hierarchical features with a novel reconstruction condition. This combined constraint is claimed to reduce false-positive hierarchical pairs (semantically unrelated parent-child features) while preserving reconstruction fidelity. The authors report that Tree SAEs outperform prior SAE variants on hierarchical-pair detection tasks and remain competitive with state-of-the-art SAEs on standard interpretability benchmarks; they further illustrate the method’s utility by mapping child-feature subspaces and uncovering hierarchical concept structures inside large language models.
Significance. If the empirical claims hold, the work supplies a concrete, implementable improvement to the SAE toolkit that directly targets a known failure mode (feature absorption and spurious hierarchies). The reconstruction-based link between levels is a modeling choice that could be adopted by other hierarchical or compositional interpretability methods, and the geometric analysis of child subspaces offers a new lens on how LLMs encode nested concepts.
major comments (2)
- [Abstract, §4] Abstract and §4 (Experiments): the central claim that Tree SAEs “significantly surpass” existing SAEs on hierarchical pairs is stated without any numerical results, error bars, dataset sizes, or ablation tables in the abstract and is only summarized at high level in the provided text. Without the quantitative evidence (e.g., precision/recall on hierarchical pairs, comparison to the activation-only baseline), the superiority assertion cannot be evaluated and is load-bearing for the paper’s contribution.
- [§3] §3 (Method): the novel reconstruction condition is introduced as eliminating false positives that activation coverage alone produces, yet no formal statement, loss term, or pseudocode is supplied in the excerpt. It is therefore impossible to verify that the condition is independent of the activation constraint and does not reduce to a reparameterization of existing objectives.
minor comments (2)
- [Abstract] The abstract refers to “several key benchmarks” without naming them or citing the corresponding tables/figures; adding explicit references would improve readability.
- [Abstract] The phrase “parameter-free” is used in the reader’s summary but does not appear in the abstract; if the method truly introduces no new hyperparameters, this should be stated explicitly with a comparison to the baseline SAE training objective.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments correctly identify places where the presentation of empirical results and methodological details can be strengthened. We have revised the manuscript to address both points.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Experiments): the central claim that Tree SAEs “significantly surpass” existing SAEs on hierarchical pairs is stated without any numerical results, error bars, dataset sizes, or ablation tables in the abstract and is only summarized at high level in the provided text. Without the quantitative evidence (e.g., precision/recall on hierarchical pairs, comparison to the activation-only baseline), the superiority assertion cannot be evaluated and is load-bearing for the paper’s contribution.
Authors: We agree that the abstract as submitted does not contain the supporting numerical evidence and that this weakens the ability to evaluate the central claim. In the revised version we have updated the abstract to report the key quantitative results: Tree SAEs achieve 0.82 precision and 0.71 recall on hierarchical-pair detection (vs. 0.67/0.58 for the activation-coverage baseline), with standard deviations from five independent runs, on datasets of 10k–50k activation samples. We also added a one-sentence summary of the ablation study and a direct comparison to the activation-only baseline. The full tables, error bars, and dataset details remain in §4 but are now cross-referenced from the abstract. revision: yes
-
Referee: [§3] §3 (Method): the novel reconstruction condition is introduced as eliminating false positives that activation coverage alone produces, yet no formal statement, loss term, or pseudocode is supplied in the excerpt. It is therefore impossible to verify that the condition is independent of the activation constraint and does not reduce to a reparameterization of existing objectives.
Authors: We accept that the excerpt reviewed did not contain the formal statement. The full manuscript defines the reconstruction condition as an additive loss term L_recon = E[||decode(a_parent) − decode(a_child)||^2] that penalizes deviation between the parent’s decoded direction and the child’s decoded direction; this term is independent of the binary activation-coverage mask. We have now inserted the exact mathematical formulation, the combined objective, and pseudocode for the training loop into §3 so that readers can verify the separation from the activation constraint. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central contribution is the introduction of a novel reconstruction condition to supplement activation coverage for identifying hierarchical feature pairs in SAEs. This modeling choice is presented as an independent addition motivated by observed false positives in prior activation-only methods, with no equations, fitted parameters, or self-citations shown that would reduce the claimed hierarchical learning or benchmark improvements to a definitional equivalence or prior fit. The derivation chain remains self-contained as an empirical modeling proposal evaluated on external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A reconstruction condition between parent and child features enforces a deeper functional link than activation coverage alone.
invented entities (1)
-
Tree SAE
no independent evidence
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/2410.11179. Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. https://openaipublic.blob.core .windows.net/neuron-explainer/paper/i ndex.html,
-
[2]
arXiv preprint arXiv:2412.06410 , year=
https://transformer- circuits.pub/2023/monosemantic-features/index.html. Bussmann, B., Leask, P., and Nanda, N. Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410,
-
[3]
Chanin, D., Wilken-Smith, J., Dulka, T., Bhatnagar, H., Golechha, S., and Bloom, J
Bussmann, B., Nabeshima, N., Karvonen, A., and Nanda, N. Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547,
-
[4]
Chanin, D., Wilken-Smith, J., Dulka, T., Bhatnagar, H., Golechha, S., and Bloom, J. A is for absorption: Studying feature splitting and absorption in sparse autoencoders. arXiv preprint arXiv:2409.14507,
-
[5]
Cho, I. and Hockenmaier, J. Toward efficient sparse autoencoder-guided steering for improved in-context learning in large language models. In Christodoulopou- los, C., Chakraborty, T., Rose, C., and Peng, V . (eds.), Proceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pp. 28961–28973, Suzhou, China, November
work page 2025
-
[6]
Association for Com- putational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.1474. URL https: //aclanthology.org/2025.emnlp-main.14 74/. Costa, V ., Fel, T., Lubana, E. S., Tolooshams, B., and Ba, D. From flat to hierarchical: Extracting sparse representations with matching pursuit.arXiv preprint arXiv:2506.03093,
-
[7]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly inter- pretable features in language models.arXiv preprint arXiv:2309.08600,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
arXiv preprint arXiv:2405.14860 , year=
Engels, J., Liao, I., Michaud, E. J., Gurnee, W., and Tegmark, M. Not all language model features are lin- ear, 2024.URL https://arxiv. org/abs/2405.14860,
-
[9]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Scaling and evaluating sparse autoencoders
Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scal- ing and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093,
work page internal anchor Pith review arXiv
-
[11]
Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troit- skii, D., and Bertsimas, D. Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610,
-
[12]
Karvonen, A., Rager, C., Lin, J., Tigges, C., Bloom, J., Chanin, D., Lau, Y .-T., Farrell, E., McDougall, C., Ayon- rinde, K., et al. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretabil- ity.arXiv preprint arXiv:2503.09532,
-
[13]
URL https://proceedings.neurips.cc/paper _files/paper/2022/file/c32319f4868da 7613d78af9993100e42-Paper-Conference. pdf. Leask, P., Bussmann, B., Pearce, M., Bloom, J., Tigges, C., Moubayed, N. A., Sharkey, L., and Nanda, N. Sparse au- toencoders do not find canonical units of analysis.arXiv preprint arXiv:2502.04878,
-
[14]
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V ., Kram´ar, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoen- coders everywhere all at once on gemma 2.arXiv preprint arXiv:2408.05147,
work page internal anchor Pith review arXiv
-
[15]
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Marks, S., Rager, C., Michaud, E. J., Belinkov, Y ., Bau, D., and Mueller, A. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647,
work page internal anchor Pith review arXiv
-
[16]
Minder, J., Dumas, C., Juang, C., Chugtai, B., and Nanda, N. Robustly identifying concepts introduced during chat fine-tuning using crosscoders.arXiv preprint arXiv:2504.02922,
-
[17]
The Linear Representation Hypothesis and the Geometry of Large Language Models
Park, K., Choe, Y . J., and Veitch, V . The linear represen- tation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
arXiv preprint arXiv:2406.01506 , year=
Park, K., Choe, Y . J., Jiang, Y ., and Veitch, V . The geometry of categorical and hierarchical concepts in large language models.arXiv preprint arXiv:2406.01506,
-
[19]
Automatically interpreting millions of features in large language models
Paulo, G., Mallen, A., Juang, C., and Belrose, N. Automati- cally interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928,
-
[20]
Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, W
Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V ., Kram´ar, J., and Nanda, N. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders.arXiv preprint arXiv:2407.14435,
-
[21]
Sun, X., Stolfo, A., Engels, J., Wu, B., Rajamanoharan, S., Sachan, M., and Tegmark, M. Dense sae latents are features, not bugs.arXiv preprint arXiv:2506.15679,
-
[22]
thought” of LLM by finding the “circuit
doi: 10.34740/KAGGLE/M/3301. URLhttps://www.kaggle.com/m/3301. 10 Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders A. Limitations Although attaining promising results, our SAE contains a high number of dead features compared to other SAEs due to the rigid structure. While dynamic allocation helps, we believe that this problem can ...
-
[23]
happens when the feature representing a concept developed systematic “blind spots” because of more specialised features. For example, consider a general feature that activates on female names like “Mary”, “Lily”, “Jane”, etc. If another feature specialises in detecting only “Lily”, the sparsity optimisation will likely push the general feature to activate...
work page 2025
-
[24]
propose Matryoshka SAE that learns multi-level features using Matryoshka representation (Kusupati et al., 2022). Specifically, they encourage the features to reconstruct the input at multiple layers, fostering the learning of both generalised and specialised features. This has been shown (Bussmann et al.,
work page 2022
-
[25]
proposes Matching Pursuit SAE that applies the well-known Matching Pursuit algorithm to learn hierarchical features. The features set is “conditional orthogonal” (Costa et al., 2025; Park et al.,
work page 2025
-
[26]
cannot fully reconstruct. However, despite the initial success, all of the previous hierarchical SAEs learn independent features and can not point out the relations between the feature pairs. Furthermore, even when an algorithm such as MCS (Bricken et al., 2023; Bussmann et al.,
work page 2023
-
[27]
We train with a batch size of 5120, learning rate 1e-4 on 500M tokens of The Pile (Gao et al., 2020)
with dictionary sizes of 24576 in our main result, and 6144 and 49152 in the scaling experiment. We train with a batch size of 5120, learning rate 1e-4 on 500M tokens of The Pile (Gao et al., 2020). We employ the Adam optimiser and normalise the gradient as well as the decoder vector at each backward step. These hyperparameters are shared across all SAEs....
work page 2020
-
[28]
Auxiliary loss:For TopK, Matryoshka, and Tree SAE, we use auxiliary loss. We set the number of auxiliary top-k as 256, and the coefficient is 1/32 as in the original implementation for both Matryoshka and TopK (Gao et al., 2024; Bussmann et al., 2025). We set the same hyperparameter for Tree SAE; however, we only use auxiliary loss for features at the fir...
work page 2024
-
[29]
We consider a feature that represents a first-letter concept as having F1 >0.4 on the task
again selects the top features for the first letter task using a k-sparse probe, and measures whether adding additional features leads to a significant improvement (F1 score increase of more than 0.03 (Chanin et al., 2024)) in detecting the first letter. We consider a feature that represents a first-letter concept as having F1 >0.4 on the task. The remain...
work page 2024
-
[30]
AutoInterp We follow the procedure of AutoInterp (Bills et al., 2023; Paulo et al.,
D.3. AutoInterp We follow the procedure of AutoInterp (Bills et al., 2023; Paulo et al.,
work page 2023
-
[31]
in measuring the interpretability of the features. Specifically, a large language model is presented with a number of activation examples of a feature, then is asked to predict the rank of different examples by feature activation. We randomly select 200 features to evaluate; the remaining setup is the same as in (Karvonen et al., 2025). E. Cases Where Act...
work page 2025
-
[32]
In particular, we compute the best numbers of children k∗ l , then sample the pool of dead child features, usually defined as features that are inactive for 10M tokens (Gao et al., 2024; Cunningham et al., 2023). In the step of assigning child features to match the optimal allocation, we employ first fit strategy for all of the parents that have less than...
work page 2024
-
[33]
We find this can helps reducing number of dead features in Gemma-2-2b (Team, 2024)
prerequisited number of child features for the root node at l= 0 . We find this can helps reducing number of dead features in Gemma-2-2b (Team, 2024). Algorithm 2Full dynamical feature reallocation algorithm. Input: Current assignment vectors {al ∀l∈ {1, . . . , L}} ; Number of child features sl at layer l. forlayerlin{1, . . . , L}do Compute capacity set...
work page 2024
-
[34]
Solving these and remove k2, we have ep ≃d p −kd c/2 and ec ≃d c −kd p. Therefore eT c dp ≃0 . This completes the proof. Interpretation:The child encoder is not a good approximation of the true concept vector d∗ c as it avoids aligning with parent feature caused by the default reconstruction loss optimization. This leads to the low cosine similarity betwe...
work page 2025
-
[35]
or compute the correlation only (Bricken et al., 2023). We propose another setup that, instead of measuring the correlation, we treat the activation of a feature as binary (1 for non-zero activation and 0 for no activation) and measure the correlation on the binary vectors. Thus, we test 4 possible versions (whether to use binary and whether to use scalin...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.