Recognition: no theorem link
Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs
Pith reviewed 2026-05-12 01:09 UTC · model grok-4.3
The pith
Uncertain questions strengthen rivalry between negatively correlated sparse autoencoder features at specific layers in LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that feature rivalry acts as a mechanistic signature of uncertainty. In controlled tests, high-entropy questions produce stronger negative correlations between SAE features at layers 0 and 12 compared to low-entropy ones. Intervening by steering along the rivalry axis leads to more output changes than controls, and per-prompt rivalry scores correlate with answer correctness at AUROC 0.689.
What carries the argument
Feature rivalry, the negative correlation between pairs of active SAE features, which the paper proposes as an indicator of processing uncertainty in the residual stream.
If this is right
- Stronger feature rivalry occurs at layers 0 and 12 for high-entropy questions.
- Steering along rivalry directions (vec_A - vec_B) changes outputs more than random directions for 15 of 20 pairs.
- Per-prompt rivalry scores predict answer correctness with AUROC 0.689, compared to 0.808 for softmax confidence.
Where Pith is reading between the lines
- Rivalry could serve as an internal signal for monitoring uncertainty in deployed models without needing external labels.
- The pattern may appear across other architectures or tasks, providing a general probe for how transformers handle ambiguity.
- Adjusting training to reduce rivalry on uncertain inputs might improve overall calibration.
Load-bearing premise
The negative correlations between SAE features directly reflect the model's uncertainty about the answer rather than resulting from differences in question difficulty or from artifacts in SAE training and feature selection.
What would settle it
Measuring rivalry after matching high- and low-entropy questions for difficulty, or finding that rivalry steering no longer outperforms random directions in a controlled replication, would challenge the claim.
Figures
read the original abstract
Sparse Autoencoders (SAEs) decompose large language model representations into interpretable features, but how these features interact under uncertainty remains poorly understood. We introduce Feature Rivalry -- negatively correlated SAE feature pairs -- and study whether rivalry serves as a mechanistic signature of model uncertainty in Gemma-2-2B using Gemma Scope SAEs. Through a controlled within-domain experiment on PopQA split by response entropy, we find that high-entropy questions produce significantly stronger feature rivalry at layers 0 and 12 relative to low-entropy questions (p=5.3x10^-26 and p=5.8x10^-5 respectively), localizing uncertainty to specific processing stages in the residual stream. We then test whether rivalry is causally upstream of model outputs via activation steering along rivalry axes -- finding that steering along the rivalry direction (vec_A - vec_B) causes more output changes than random directions at low steering multipliers across 15 of 20 rival feature pairs. Finally, a per-prompt rivalry score derived from pairwise cosine similarities of active SAE feature decoder vectors predicts answer correctness (AUROC=0.689), approaching but not matching softmax confidence (AUROC=0.808).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces 'Feature Rivalry' as negatively correlated pairs of SAE features and examines whether this serves as a mechanistic signature of uncertainty in Gemma-2-2B using Gemma Scope SAEs. On a within-domain PopQA split by response entropy, it reports significantly stronger rivalry at layers 0 and 12 for high-entropy questions (p=5.3e-26 and p=5.8e-5), shows that steering along rivalry axes (vec_A - vec_B) produces more output changes than random directions for 15 of 20 pairs, and finds that a per-prompt rivalry score based on decoder-vector cosine similarities predicts answer correctness (AUROC=0.689).
Significance. If the empirical patterns survive controls for SAE training artifacts and pair-selection effects, the work would add a concrete mechanistic account of uncertainty via feature competition in the residual stream, complementing existing logit- and activation-based uncertainty measures. The interventional steering results and AUROC comparison provide falsifiable tests that could be extended to other models and tasks.
major comments (3)
- [PopQA entropy-split experiment] The rival pairs are identified by scanning negative correlations across the full dataset before the entropy split (see the PopQA experiment description). This selection procedure makes the subsequent high- vs. low-entropy comparison vulnerable to bias: high-entropy prompts may simply activate more overlapping features whose decoder vectors are already anti-aligned by the SAE dictionary, inflating apparent rivalry differences without a direct causal link to uncertainty.
- [Methods / SAE training details] The sparsity penalty in SAE training is known to induce negative correlations between features that compete for the same residual-stream directions, independent of input entropy. The within-domain split controls topic but does not control for the number or overlap of active features; without an ablation that matches the number of active features across entropy regimes or compares against a non-sparse baseline, the claim that rivalry is a signature of uncertainty rather than a training artifact remains under-supported.
- [Activation steering experiments] The steering results state that rivalry-direction interventions cause more output changes than random directions at low multipliers, yet no error bars, exact number of trials per pair, or magnitude-matched controls for the steering vector length are reported. Without these, it is impossible to determine whether the effect is robust or driven by a few outlier pairs.
minor comments (2)
- [Rivalry score definition] The exact formula for the per-prompt rivalry score (pairwise cosine similarities of active decoder vectors) should be stated explicitly, including how ties or zero activations are handled.
- [Results] The statistical test underlying the reported p-values (t-test, Wilcoxon, etc.) and whether multiple-comparison correction was applied should be stated in the main text, not only in the abstract.
Simulated Author's Rebuttal
Thank you for the detailed review. We appreciate the opportunity to clarify and strengthen our analysis of feature rivalry as a signature of uncertainty in SAE representations. Below we respond to each major comment.
read point-by-point responses
-
Referee: [PopQA entropy-split experiment] The rival pairs are identified by scanning negative correlations across the full dataset before the entropy split (see the PopQA experiment description). This selection procedure makes the subsequent high- vs. low-entropy comparison vulnerable to bias: high-entropy prompts may simply activate more overlapping features whose decoder vectors are already anti-aligned by the SAE dictionary, inflating apparent rivalry differences without a direct causal link to uncertainty.
Authors: We thank the referee for highlighting this potential selection bias. The global identification of rival pairs was chosen to capture stable feature competitions that are not specific to any particular prompt set. However, to directly address the concern, we will add a supplementary analysis in which we identify rival pairs independently within the high-entropy and low-entropy subsets and then compare the rivalry metrics. Additionally, we will report the average number of active features in each regime to show that the difference is not solely due to feature count. This revision will strengthen the causal interpretation. revision: partial
-
Referee: [Methods / SAE training details] The sparsity penalty in SAE training is known to induce negative correlations between features that compete for the same residual-stream directions, independent of input entropy. The within-domain split controls topic but does not control for the number or overlap of active features; without an ablation that matches the number of active features across entropy regimes or compares against a non-sparse baseline, the claim that rivalry is a signature of uncertainty rather than a training artifact remains under-supported.
Authors: We agree that the sparsity penalty can induce negative correlations as a training artifact. Our within-domain split helps control for topic, but we acknowledge the need for further controls on feature activation counts. In the revision, we will include an ablation where we subsample prompts or features to match the number of active features between high- and low-entropy conditions and re-compute the rivalry differences. We will also discuss why a non-sparse baseline is not directly applicable here, as our focus is on the interpretable SAE features. These additions will better isolate the role of uncertainty. revision: yes
-
Referee: [Activation steering experiments] The steering results state that rivalry-direction interventions cause more output changes than random directions at low multipliers, yet no error bars, exact number of trials per pair, or magnitude-matched controls for the steering vector length are reported. Without these, it is impossible to determine whether the effect is robust or driven by a few outlier pairs.
Authors: We appreciate this point on the reporting of the steering experiments. In the revised manuscript, we will include error bars (standard error across trials), specify the exact number of trials per pair (we used 50 generations per steering vector), and add magnitude-matched controls by normalizing all steering vectors to unit length before applying multipliers. This will confirm that the observed differences are not due to vector magnitude or outliers. revision: yes
Circularity Check
No circularity: purely empirical measurements and interventions with no self-referential derivations
full rationale
The paper defines feature rivalry directly as negatively correlated SAE feature pairs and measures it via cosine similarities on decoder vectors from held-out prompts. Statistical comparisons (p-values on entropy splits), steering experiments, and AUROC prediction of correctness are all downstream empirical tests on independent data splits. No equations reduce a claimed result to its own inputs by construction, no parameters are fitted then relabeled as predictions, and no load-bearing claims rest on self-citations. The per-prompt rivalry score is computed from observed activations rather than being tautological with the entropy label or uncertainty measure.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SAE features are sufficiently monosemantic and stable to support mechanistic claims about rivalry
- domain assumption PopQA entropy split cleanly separates low- and high-uncertainty regimes without confounding difficulty factors
invented entities (1)
-
Feature Rivalry
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Transformer Circuits Thread , year=
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author=. Transformer Circuits Thread , year=
-
[2]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. arXiv preprint arXiv:2309.08600 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Proceedings of the 8th BlackboxNLP Workshop , pages=
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author=. Proceedings of the 8th BlackboxNLP Workshop , pages=
-
[4]
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year=
When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year=
-
[5]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2: Improving Open Language Models at a Practical Size , author=. arXiv preprint arXiv:2408.00118 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
International Conference on Learning Representations , year=
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. International Conference on Learning Representations , year=
-
[7]
Uncertainty estimation in autoregressive structured prediction
Uncertainty Estimation in Autoregressive Structured Prediction , author=. arXiv preprint arXiv:2002.07650 , year=
-
[8]
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author=. arXiv preprint arXiv:2403.19647 , year=
work page internal anchor Pith review arXiv
-
[9]
Transformer Circuits Thread , year=
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. Transformer Circuits Thread , year=
-
[10]
Proceedings of the 8th BlackboxNLP Workshop , pages=
PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders , author=. Proceedings of the 8th BlackboxNLP Workshop , pages=
-
[11]
Proceedings of the 8th BlackboxNLP Workshop , pages=
Steering Prepositional Phrases in Language Models: A Case of with-headed Adjectival and Adverbial Complements in Gemma-2 , author=. Proceedings of the 8th BlackboxNLP Workshop , pages=
-
[12]
Gulko, Alex and Peng, Yusen and Kumar, Sachin , booktitle=
-
[13]
Proceedings of the 8th BlackboxNLP Workshop , pages=
Language Dominance in Multilingual Large Language Models , author=. Proceedings of the 8th BlackboxNLP Workshop , pages=
-
[14]
Proceedings of the 7th BlackboxNLP Workshop , year=
Log Probabilities Are a Reliable Estimate of Semantic Plausibility in Base and Instruction-Tuned Language Models , author=. Proceedings of the 7th BlackboxNLP Workshop , year=
- [15]
-
[16]
Transformer Circuits Thread , year=
In-context Learning and Induction Heads , author=. Transformer Circuits Thread , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.