Recognition: unknown
Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features
Pith reviewed 2026-05-14 20:24 UTC · model grok-4.3
The pith
Many distinct SAE features receive identical natural-language explanations, so existing auto-interpretability scores overstate how uniquely each feature is identified.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Descriptive collision occurs when many distinct SAE features admit the same natural-language explanation. In the Marks et al. dataset of 722 annotated features from Gemma 2 2B and Pythia 70M, the mean annotation is reused across 3.07 features, 82.1 percent of features share their annotation with at least one other, and the most common string (plural nouns) labels 101 features across 18 layers and four model components. The average annotation resolves only 70 percent of feature identity. Current detection scoring is invariant to this reuse, and the collision problem is independent of and additive with polysemanticity.
What carries the argument
Descriptive collision, the reuse of one explanation string across multiple distinct SAE features, together with the discrimination metric that quantifies how completely an explanation isolates its target feature from all others.
If this is right
- Current auto-interpretability scores overstate true feature identifiability by an amount equal to roughly one-third of the bits required to name a feature.
- Detection-only metrics remain unchanged even when every explanation is shared by many features, so they cannot detect collision.
- Collision-adjusted detection and discrimination scoring must be used together with existing metrics to avoid additive overestimation.
- The problem appears across layers and model components and is therefore not limited to any single SAE training regime.
Where Pith is reading between the lines
- Auto-interpretability pipelines may need to generate longer or more contrastive explanations that explicitly rule out neighboring features.
- If collision persists under automated explanation methods, it would indicate that the underlying SAE features themselves are not cleanly separable by short natural-language descriptions.
- Downstream uses that treat each explanation as a reliable pointer to a single feature, such as circuit editing or safety auditing, inherit the same ambiguity.
Load-bearing premise
That the existing human annotations accurately capture each feature's semantics and that unique discrimination via explanations is required for useful interpretability.
What would settle it
A new annotation pass on the same 722 features that produced unique strings for every feature and raised the recovered identity information above 95 percent would falsify the reported prevalence and severity of collision.
Figures
read the original abstract
Sparse autoencoders (SAEs) are now standard tools for decomposing language model activations into interpretable features, and automated interpretability pipelines routinely assign each feature a short natural-language explanation. Existing critiques of this practice focus on polysemanticity -- one feature with many meanings -- or on whether explanations predict activations. We identify a complementary, structurally distinct problem we call descriptive collision: many distinct SAE features admit the same explanation. Reanalyzing the largest publicly-available dataset of human-annotated SAE features (Marks et al., 2025), comprising 722 annotated features across Gemma 2 2B and Pythia 70M, we find that the mean annotation string is reused across 3.07 features; 82.1% of features share their annotation with at least one other feature; and the single most common annotation string ("plural nouns") labels 101 distinct features spanning 18 layers and four model components. Information-theoretically, the average annotation resolves only 70% of feature identity. We formalize a property called discrimination, prove that current detection-style auto-interpretability scoring is invariant to collision, and propose two complementary corrective metrics -- collision-adjusted detection and discrimination scoring -- that explicitly penalize explanations that fail to distinguish a feature from its neighbors. The collision problem is independent of, and additive with, previously identified failure modes of auto-interpretability; ignoring it inflates reported feature interpretability by a quantity equal to roughly one-third of the bits required to identify a feature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that descriptive collision—where distinct SAE features share the same natural language explanation—is a prevalent and previously underappreciated issue in auto-interpretability. By reanalyzing 722 human-annotated features from the Marks et al. (2025) dataset, they quantify high rates of explanation reuse (mean 3.07 features per annotation, 82.1% of features involved in collisions), with extreme cases like 'plural nouns' covering 101 features. They demonstrate that this leads to annotations resolving only 70% of feature identity on average, prove that standard detection scoring is invariant to collisions, and introduce new metrics to address it. The issue is presented as independent from and additive to polysemanticity.
Significance. If these findings hold, the paper significantly advances the understanding of limitations in automated interpretability for SAEs by highlighting a structural problem in explanation uniqueness. The empirical results from a large, public dataset lend credibility, and the formalization plus proposed metrics (collision-adjusted detection and discrimination scoring) offer actionable improvements. This could prevent overestimation of interpretability by about one-third of the required bits, encouraging better evaluation standards in the field. The use of information theory and mathematical proof for invariance adds rigor.
major comments (2)
- [Information-theoretic analysis] The 70% resolution figure and the 'one-third bits' inflation claim need the precise formula for how annotation entropy relates to feature identity discrimination; cite the specific equation or derivation step in the information-theoretic analysis.
- [Formalization and proof] The proof that detection-style scoring is invariant to collision should be presented with a clear mathematical argument or example in the main text or appendix to allow verification.
minor comments (2)
- [Dataset description] Specify the exact model components (e.g., which layers or modules) involved in the 101 features labeled 'plural nouns' for better context.
- [References] Ensure the citation for Marks et al. 2025 is complete and consistent throughout.
Simulated Author's Rebuttal
We thank the referee for their positive assessment and recommendation of minor revision. The comments highlight opportunities to improve clarity in the information-theoretic and formal sections, which we will address directly.
read point-by-point responses
-
Referee: [Information-theoretic analysis] The 70% resolution figure and the 'one-third bits' inflation claim need the precise formula for how annotation entropy relates to feature identity discrimination; cite the specific equation or derivation step in the information-theoretic analysis.
Authors: We agree that an explicit formula strengthens the presentation. The 70% resolution is computed as the average normalized mutual information I(feature identity; annotation) / H(feature identity), equivalently 1 - H(annotation | feature) / log2(N), where N is the number of features. The one-third bits inflation is the complement (approximately 0.3) of this quantity. In the revision we will insert the exact equation and derivation steps into Section 3.2 of the main text, with the full expansion in the appendix. revision: yes
-
Referee: [Formalization and proof] The proof that detection-style scoring is invariant to collision should be presented with a clear mathematical argument or example in the main text or appendix to allow verification.
Authors: We concur that a self-contained argument improves verifiability. Detection scoring depends solely on the match between an explanation and a single feature's activation statistics; it is unchanged when other features receive the identical explanation. We will add a concise proof (by direct substitution into the scoring formula) plus a small numerical example to the main text, with the full derivation in the appendix. revision: yes
Circularity Check
No significant circularity; results are direct reanalysis of independent public dataset using standard information theory
full rationale
The paper's core claims consist of empirical counts (mean reuse 3.07, 82.1% sharing, 101 features for 'plural nouns') and an information-theoretic calculation (average annotation resolves 70% of identity) performed on the external Marks et al. 2025 dataset. The formalization of 'discrimination' and the proof that detection-style scoring is invariant to collision are mathematical arguments that stand independently of any fitted parameters or self-referential definitions within this work. No equations reduce a prediction to a fitted input by construction, no uniqueness theorems are imported from the authors' prior work, and no ansatz is smuggled via self-citation. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard information-theoretic measure of how much an annotation resolves feature identity (entropy-based discrimination)
Reference graph
Works this paper leans on
-
[1]
Arditi, A., Obeso, O., Sucholutsky, I., Belrose, N., Coulier Lehalleur, P.-A., Mossing, D., Bhattacharyya, P., Conmy, A., Belinkov, Y., & Nanda, N. (2024). Refusal in language models is mediated by a single direction. arXiv:2406.11717
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [2]
-
[3]
Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., & Saunders, W. (2023). Language models can explain neurons in language models. OpenAI Research
2023
-
[4]
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., et al. (2023). Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread
2023
- [5]
- [6]
-
[7]
Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2024). Sparse autoencoders find highly interpretable features in language models. In International Conference on Learning Representations
2024
-
[8]
Gao, L., Dupr´ e la Tour, T., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., & Wu, J. (2024). Scaling and evaluating sparse autoencoders. arXiv:2406.04093
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Goodhart, C. A. E. (1975). Problems of monetary management: The U.K. experience. Papers in Monetary Economics, Reserve Bank of Australia. 10
1975
- [10]
- [11]
- [12]
- [13]
- [14]
-
[15]
Y., Oseledets, I., & Tutubalina, E
Korznikov, A., Galichin, A., Dontsov, A., Rogov, O. Y., Oseledets, I., & Tutubalina, E. (2026). Sanity checks for sparse autoencoders: Do SAEs beat random baselines? arXiv:2602.14111
- [16]
-
[17]
Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram´ ar, J., Dragan, A., Shah, R., & Nanda, N. (2024). Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2. arXiv:2408.05147
work page internal anchor Pith review arXiv 2024
-
[18]
Lin, J., & Bloom, J. (2024). Neuronpedia: Interactive reference and tooling for analyzing neural networks.https://www.neuronpedia.org
2024
-
[19]
Ma, G., et al. (2025). Revising and falsifying sparse autoencoder feature explanations. ICLR Workshop on Mechanistic Interpretability
2025
- [20]
-
[21]
J., Belinkov, Y., Bau, D., & Mueller, A
Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., & Mueller, A. (2025). Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In International Conference on Learning Representations
2025
- [22]
- [23]
- [24]
-
[25]
Strathern, M. (1997). ‘Improving Ratings’: Audit in the British University System. European Review, 5(3): 305–321
1997
-
[26]
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jermyn, A., et al. (2024). Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread. 11
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.