arxiv: 2605.12874 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: unknown

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

Jordan F. McCann

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse autoencodersauto-interpretabilitydescriptive collisionfeature explanationsdiscrimination scoringSAE featuresinterpretability metricshuman annotations

0 comments

The pith

Many distinct SAE features receive identical natural-language explanations, so existing auto-interpretability scores overstate how uniquely each feature is identified.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies descriptive collision as a distinct failure mode in sparse autoencoder interpretability: the same short explanation is routinely assigned to multiple unrelated features. Re-examining the largest public set of human-annotated SAE features, the authors measure that the average annotation string is shared by 3.07 features, that 82 percent of features collide with at least one other, and that the single most frequent label covers 101 separate features. Information-theoretically these annotations recover only about 70 percent of the bits needed to specify which feature is which. Because current detection-style scoring ignores whether an explanation distinguishes its target from its neighbors, reported interpretability numbers are inflated by roughly one-third of the required identification cost. The authors introduce two new metrics, collision-adjusted detection and discrimination scoring, that explicitly penalize non-unique explanations.

Core claim

Descriptive collision occurs when many distinct SAE features admit the same natural-language explanation. In the Marks et al. dataset of 722 annotated features from Gemma 2 2B and Pythia 70M, the mean annotation is reused across 3.07 features, 82.1 percent of features share their annotation with at least one other, and the most common string (plural nouns) labels 101 features across 18 layers and four model components. The average annotation resolves only 70 percent of feature identity. Current detection scoring is invariant to this reuse, and the collision problem is independent of and additive with polysemanticity.

What carries the argument

Descriptive collision, the reuse of one explanation string across multiple distinct SAE features, together with the discrimination metric that quantifies how completely an explanation isolates its target feature from all others.

If this is right

Current auto-interpretability scores overstate true feature identifiability by an amount equal to roughly one-third of the bits required to name a feature.
Detection-only metrics remain unchanged even when every explanation is shared by many features, so they cannot detect collision.
Collision-adjusted detection and discrimination scoring must be used together with existing metrics to avoid additive overestimation.
The problem appears across layers and model components and is therefore not limited to any single SAE training regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Auto-interpretability pipelines may need to generate longer or more contrastive explanations that explicitly rule out neighboring features.
If collision persists under automated explanation methods, it would indicate that the underlying SAE features themselves are not cleanly separable by short natural-language descriptions.
Downstream uses that treat each explanation as a reliable pointer to a single feature, such as circuit editing or safety auditing, inherit the same ambiguity.

Load-bearing premise

That the existing human annotations accurately capture each feature's semantics and that unique discrimination via explanations is required for useful interpretability.

What would settle it

A new annotation pass on the same 722 features that produced unique strings for every feature and raised the recovered identity information above 95 percent would falsify the reported prevalence and severity of collision.

Figures

Figures reproduced from arXiv: 2605.12874 by Jordan F. McCann.

**Figure 2.** Figure 2: Cumulative fraction of features covered by annotation strings, ranked by cluster size. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Sparse autoencoders (SAEs) are now standard tools for decomposing language model activations into interpretable features, and automated interpretability pipelines routinely assign each feature a short natural-language explanation. Existing critiques of this practice focus on polysemanticity -- one feature with many meanings -- or on whether explanations predict activations. We identify a complementary, structurally distinct problem we call descriptive collision: many distinct SAE features admit the same explanation. Reanalyzing the largest publicly-available dataset of human-annotated SAE features (Marks et al., 2025), comprising 722 annotated features across Gemma 2 2B and Pythia 70M, we find that the mean annotation string is reused across 3.07 features; 82.1% of features share their annotation with at least one other feature; and the single most common annotation string ("plural nouns") labels 101 distinct features spanning 18 layers and four model components. Information-theoretically, the average annotation resolves only 70% of feature identity. We formalize a property called discrimination, prove that current detection-style auto-interpretability scoring is invariant to collision, and propose two complementary corrective metrics -- collision-adjusted detection and discrimination scoring -- that explicitly penalize explanations that fail to distinguish a feature from its neighbors. The collision problem is independent of, and additive with, previously identified failure modes of auto-interpretability; ignoring it inflates reported feature interpretability by a quantity equal to roughly one-third of the bits required to identify a feature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper quantifies how often the same explanation gets assigned to multiple distinct SAE features and shows current scores ignore it, but the root may be in how coarse the human labels are.

read the letter

The main point is that SAE auto-interpretability explanations collide a lot: the same string labels multiple features. Reanalyzing the Marks et al. 2025 set of 722 annotated features, the paper reports mean reuse of 3.07 times, 82% of features sharing a label, and one string (plural nouns) covering 101 features. Information-theoretically the average explanation only resolves about 70% of which feature it is. They separate this from polysemanticity, formalize a discrimination property, show that standard detection scoring stays unchanged under collision, and offer two adjusted metrics that penalize non-distinguishing explanations. The reanalysis uses public data and standard information theory, so the numbers are easy to check. That part is useful and additive to existing critiques. The soft spot is the heavy reliance on the human annotations as the right granularity for feature identity. If annotators default to broad terms like plural nouns across unrelated features, the collision reflects labeling practice more than a failure of the explanations themselves. The paper does not add checks for annotation consistency or inter-annotator agreement, so the claim that this inflates interpretability by roughly one-third of the needed bits rests on that untested assumption. The invariance proof is stated but the details are not in the abstract, so a referee would want to see the full derivation. This is for people running or evaluating SAE interpretability pipelines. It deserves peer review because it flags a concrete, measurable gap in current practice and proposes fixes that can be tested directly, even if the annotations need extra validation.

Referee Report

2 major / 2 minor

Summary. The paper claims that descriptive collision—where distinct SAE features share the same natural language explanation—is a prevalent and previously underappreciated issue in auto-interpretability. By reanalyzing 722 human-annotated features from the Marks et al. (2025) dataset, they quantify high rates of explanation reuse (mean 3.07 features per annotation, 82.1% of features involved in collisions), with extreme cases like 'plural nouns' covering 101 features. They demonstrate that this leads to annotations resolving only 70% of feature identity on average, prove that standard detection scoring is invariant to collisions, and introduce new metrics to address it. The issue is presented as independent from and additive to polysemanticity.

Significance. If these findings hold, the paper significantly advances the understanding of limitations in automated interpretability for SAEs by highlighting a structural problem in explanation uniqueness. The empirical results from a large, public dataset lend credibility, and the formalization plus proposed metrics (collision-adjusted detection and discrimination scoring) offer actionable improvements. This could prevent overestimation of interpretability by about one-third of the required bits, encouraging better evaluation standards in the field. The use of information theory and mathematical proof for invariance adds rigor.

major comments (2)

[Information-theoretic analysis] The 70% resolution figure and the 'one-third bits' inflation claim need the precise formula for how annotation entropy relates to feature identity discrimination; cite the specific equation or derivation step in the information-theoretic analysis.
[Formalization and proof] The proof that detection-style scoring is invariant to collision should be presented with a clear mathematical argument or example in the main text or appendix to allow verification.

minor comments (2)

[Dataset description] Specify the exact model components (e.g., which layers or modules) involved in the 101 features labeled 'plural nouns' for better context.
[References] Ensure the citation for Marks et al. 2025 is complete and consistent throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation of minor revision. The comments highlight opportunities to improve clarity in the information-theoretic and formal sections, which we will address directly.

read point-by-point responses

Referee: [Information-theoretic analysis] The 70% resolution figure and the 'one-third bits' inflation claim need the precise formula for how annotation entropy relates to feature identity discrimination; cite the specific equation or derivation step in the information-theoretic analysis.

Authors: We agree that an explicit formula strengthens the presentation. The 70% resolution is computed as the average normalized mutual information I(feature identity; annotation) / H(feature identity), equivalently 1 - H(annotation | feature) / log2(N), where N is the number of features. The one-third bits inflation is the complement (approximately 0.3) of this quantity. In the revision we will insert the exact equation and derivation steps into Section 3.2 of the main text, with the full expansion in the appendix. revision: yes
Referee: [Formalization and proof] The proof that detection-style scoring is invariant to collision should be presented with a clear mathematical argument or example in the main text or appendix to allow verification.

Authors: We concur that a self-contained argument improves verifiability. Detection scoring depends solely on the match between an explanation and a single feature's activation statistics; it is unchanged when other features receive the identical explanation. We will add a concise proof (by direct substitution into the scoring formula) plus a small numerical example to the main text, with the full derivation in the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are direct reanalysis of independent public dataset using standard information theory

full rationale

The paper's core claims consist of empirical counts (mean reuse 3.07, 82.1% sharing, 101 features for 'plural nouns') and an information-theoretic calculation (average annotation resolves 70% of identity) performed on the external Marks et al. 2025 dataset. The formalization of 'discrimination' and the proof that detection-style scoring is invariant to collision are mathematical arguments that stand independently of any fitted parameters or self-referential definitions within this work. No equations reduce a prediction to a fitted input by construction, no uniqueness theorems are imported from the authors' prior work, and no ansatz is smuggled via self-citation. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical counts from the external Marks et al. dataset and standard information-theoretic definitions of discrimination; no free parameters, ad-hoc axioms, or invented entities are introduced.

axioms (1)

standard math Standard information-theoretic measure of how much an annotation resolves feature identity (entropy-based discrimination)
Invoked to compute the 70% resolution figure and prove invariance of existing scoring

pith-pipeline@v0.9.0 · 5568 in / 1204 out tokens · 40604 ms · 2026-05-14T20:24:25.983346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Arditi, A., Obeso, O., Sucholutsky, I., Belrose, N., Coulier Lehalleur, P.-A., Mossing, D., Bhattacharyya, P., Conmy, A., Belinkov, Y., & Nanda, N. (2024). Refusal in language models is mediated by a single direction. arXiv:2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Ayonrinde, K., et al. (2024). Interpretability as compression: Reconsidering SAE explanations of neural activations with MDL-SAEs. arXiv:2410.11179

work page arXiv 2024
[3]

Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., & Saunders, W. (2023). Language models can explain neurons in language models. OpenAI Research

2023
[4]

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., et al. (2023). Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread

2023
[5]

Bussmann, B., Nabeshima, N., Karvonen, A., & Nanda, N. (2025). Learning multi-level fea- tures with matryoshka sparse autoencoders. arXiv:2503.17547

work page arXiv 2025
[6]

Chanin, D., Wilken-Smith, J., Dulka, T., Bhatnagar, H., & Bloom, J. (2024). A is for absorp- tion: Studying feature splitting and absorption in sparse autoencoders. arXiv:2409.14507

work page arXiv 2024
[7]

Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2024). Sparse autoencoders find highly interpretable features in language models. In International Conference on Learning Representations

2024
[8]

Gao, L., Dupr´ e la Tour, T., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., & Wu, J. (2024). Scaling and evaluating sparse autoencoders. arXiv:2406.04093

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Goodhart, C. A. E. (1975). Problems of monetary management: The U.K. experience. Papers in Monetary Economics, Reserve Bank of Australia. 10

1975
[10]

Han, J., Xu, W., Jin, M., & Du, M. (2025). SAGE: An agentic explainer framework for interpreting SAE features in language models. arXiv:2511.20820

work page arXiv 2025
[11]

He, Z., et al. (2024). Llama scope: Extracting millions of features from Llama-3.1-8B with sparse autoencoders. arXiv:2410.20526

work page arXiv 2024
[12]

Heap, T., Lawson, T., Farnik, L., & Aitchison, L. (2025). Automated interpretability metrics do not distinguish trained and random transformers. arXiv:2501.17727

work page arXiv 2025
[13]

Huang, J., Geiger, A., D’Oosterlinck, K., Wu, Z., & Potts, C. (2023). Rigorously assessing natural language explanations of neurons. arXiv:2309.10312

work page arXiv 2023
[14]

Kantamneni, S., Engels, J., Rajamanoharan, S., Tegmark, M., & Nanda, N. (2025). Are sparse autoencoders useful? A case study in sparse probing. arXiv:2502.16681

work page arXiv 2025
[15]

Y., Oseledets, I., & Tutubalina, E

Korznikov, A., Galichin, A., Dontsov, A., Rogov, O. Y., Oseledets, I., & Tutubalina, E. (2026). Sanity checks for sparse autoencoders: Do SAEs beat random baselines? arXiv:2602.14111

work page arXiv 2026
[16]

Leask, P., et al. (2025). Sparse autoencoders do not find canonical units of analysis. arXiv:2502.04878

work page arXiv 2025
[17]

Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram´ ar, J., Dragan, A., Shah, R., & Nanda, N. (2024). Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2. arXiv:2408.05147

work page internal anchor Pith review arXiv 2024
[18]

Lin, J., & Bloom, J. (2024). Neuronpedia: Interactive reference and tooling for analyzing neural networks.https://www.neuronpedia.org

2024
[19]

Ma, G., et al. (2025). Revising and falsifying sparse autoencoder feature explanations. ICLR Workshop on Mechanistic Interpretability

2025
[20]

Makelov, A., Lange, G., & Nanda, N. (2024). Towards principled evaluations of sparse autoen- coders for interpretability and control. arXiv:2405.08366

work page arXiv 2024
[21]

J., Belinkov, Y., Bau, D., & Mueller, A

Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., & Mueller, A. (2025). Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In International Conference on Learning Representations

2025
[22]

Paulo, G., & Belrose, N. (2025a). Evaluating sparse autoencoder interpretability without ex- planations. arXiv:2507.08473

work page arXiv
[23]

Paulo, G., & Belrose, N. (2025b). Sparse autoencoders trained on the same data learn different features. arXiv:2501.16615

work page arXiv
[24]

Paulo, G., Mallen, A., Juang, C., & Belrose, N. (2024). Automatically interpreting millions of features in large language models. arXiv:2410.13928

work page arXiv 2024
[25]

Strathern, M. (1997). ‘Improving Ratings’: Audit in the British University System. European Review, 5(3): 305–321

1997
[26]

Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jermyn, A., et al. (2024). Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread. 11

2024