arxiv: 2605.08149 · v1 · submitted 2026-05-03 · 💻 cs.LG · cs.CL

Recognition: no theorem link

Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs

Harshavardhan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:09 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords sparse autoencodersfeature rivalrymechanistic interpretabilitymodel uncertaintyactivation steeringresidual streamanswer correctness

0 comments

The pith

Uncertain questions strengthen rivalry between negatively correlated sparse autoencoder features at specific layers in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how sparse autoencoder features compete under uncertainty in large language models. It introduces feature rivalry as negatively correlated feature pairs and tests this in Gemma-2-2B on a question dataset divided by response entropy. High-entropy questions show markedly stronger rivalry at layers 0 and 12. Steering model activations along these rivalry directions alters outputs more than random steering. A score based on these rivalries predicts whether the model answers correctly, though less accurately than its built-in confidence.

Core claim

The authors claim that feature rivalry acts as a mechanistic signature of uncertainty. In controlled tests, high-entropy questions produce stronger negative correlations between SAE features at layers 0 and 12 compared to low-entropy ones. Intervening by steering along the rivalry axis leads to more output changes than controls, and per-prompt rivalry scores correlate with answer correctness at AUROC 0.689.

What carries the argument

Feature rivalry, the negative correlation between pairs of active SAE features, which the paper proposes as an indicator of processing uncertainty in the residual stream.

If this is right

Stronger feature rivalry occurs at layers 0 and 12 for high-entropy questions.
Steering along rivalry directions (vec_A - vec_B) changes outputs more than random directions for 15 of 20 pairs.
Per-prompt rivalry scores predict answer correctness with AUROC 0.689, compared to 0.808 for softmax confidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Rivalry could serve as an internal signal for monitoring uncertainty in deployed models without needing external labels.
The pattern may appear across other architectures or tasks, providing a general probe for how transformers handle ambiguity.
Adjusting training to reduce rivalry on uncertain inputs might improve overall calibration.

Load-bearing premise

The negative correlations between SAE features directly reflect the model's uncertainty about the answer rather than resulting from differences in question difficulty or from artifacts in SAE training and feature selection.

What would settle it

Measuring rivalry after matching high- and low-entropy questions for difficulty, or finding that rivalry steering no longer outperforms random directions in a controlled replication, would challenge the claim.

Figures

Figures reproduced from arXiv: 2605.08149 by Harshavardhan.

**Figure 2.** Figure 2: Statistical significance of rivalry difference by layer ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Flip rates for rivalry-axis steering vs. random steering at multiplier = 5 across all 20 rival feature [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: ROC curves for rivalry score and softmax confidence as predictors of answer correctness across [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Calibration curves for rivalry score and softmax confidence. Both signals show monotonic trends: [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Flip rate gap (rivalry minus random) vs. rivalry strength (pairwise correlation) across all 20 pairs [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Sparse Autoencoders (SAEs) decompose large language model representations into interpretable features, but how these features interact under uncertainty remains poorly understood. We introduce Feature Rivalry -- negatively correlated SAE feature pairs -- and study whether rivalry serves as a mechanistic signature of model uncertainty in Gemma-2-2B using Gemma Scope SAEs. Through a controlled within-domain experiment on PopQA split by response entropy, we find that high-entropy questions produce significantly stronger feature rivalry at layers 0 and 12 relative to low-entropy questions (p=5.3x10^-26 and p=5.8x10^-5 respectively), localizing uncertainty to specific processing stages in the residual stream. We then test whether rivalry is causally upstream of model outputs via activation steering along rivalry axes -- finding that steering along the rivalry direction (vec_A - vec_B) causes more output changes than random directions at low steering multipliers across 15 of 20 rival feature pairs. Finally, a per-prompt rivalry score derived from pairwise cosine similarities of active SAE feature decoder vectors predicts answer correctness (AUROC=0.689), approaching but not matching softmax confidence (AUROC=0.808).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces 'Feature Rivalry' as negatively correlated pairs of SAE features and examines whether this serves as a mechanistic signature of uncertainty in Gemma-2-2B using Gemma Scope SAEs. On a within-domain PopQA split by response entropy, it reports significantly stronger rivalry at layers 0 and 12 for high-entropy questions (p=5.3e-26 and p=5.8e-5), shows that steering along rivalry axes (vec_A - vec_B) produces more output changes than random directions for 15 of 20 pairs, and finds that a per-prompt rivalry score based on decoder-vector cosine similarities predicts answer correctness (AUROC=0.689).

Significance. If the empirical patterns survive controls for SAE training artifacts and pair-selection effects, the work would add a concrete mechanistic account of uncertainty via feature competition in the residual stream, complementing existing logit- and activation-based uncertainty measures. The interventional steering results and AUROC comparison provide falsifiable tests that could be extended to other models and tasks.

major comments (3)

[PopQA entropy-split experiment] The rival pairs are identified by scanning negative correlations across the full dataset before the entropy split (see the PopQA experiment description). This selection procedure makes the subsequent high- vs. low-entropy comparison vulnerable to bias: high-entropy prompts may simply activate more overlapping features whose decoder vectors are already anti-aligned by the SAE dictionary, inflating apparent rivalry differences without a direct causal link to uncertainty.
[Methods / SAE training details] The sparsity penalty in SAE training is known to induce negative correlations between features that compete for the same residual-stream directions, independent of input entropy. The within-domain split controls topic but does not control for the number or overlap of active features; without an ablation that matches the number of active features across entropy regimes or compares against a non-sparse baseline, the claim that rivalry is a signature of uncertainty rather than a training artifact remains under-supported.
[Activation steering experiments] The steering results state that rivalry-direction interventions cause more output changes than random directions at low multipliers, yet no error bars, exact number of trials per pair, or magnitude-matched controls for the steering vector length are reported. Without these, it is impossible to determine whether the effect is robust or driven by a few outlier pairs.

minor comments (2)

[Rivalry score definition] The exact formula for the per-prompt rivalry score (pairwise cosine similarities of active decoder vectors) should be stated explicitly, including how ties or zero activations are handled.
[Results] The statistical test underlying the reported p-values (t-test, Wilcoxon, etc.) and whether multiple-comparison correction was applied should be stated in the main text, not only in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed review. We appreciate the opportunity to clarify and strengthen our analysis of feature rivalry as a signature of uncertainty in SAE representations. Below we respond to each major comment.

read point-by-point responses

Referee: [PopQA entropy-split experiment] The rival pairs are identified by scanning negative correlations across the full dataset before the entropy split (see the PopQA experiment description). This selection procedure makes the subsequent high- vs. low-entropy comparison vulnerable to bias: high-entropy prompts may simply activate more overlapping features whose decoder vectors are already anti-aligned by the SAE dictionary, inflating apparent rivalry differences without a direct causal link to uncertainty.

Authors: We thank the referee for highlighting this potential selection bias. The global identification of rival pairs was chosen to capture stable feature competitions that are not specific to any particular prompt set. However, to directly address the concern, we will add a supplementary analysis in which we identify rival pairs independently within the high-entropy and low-entropy subsets and then compare the rivalry metrics. Additionally, we will report the average number of active features in each regime to show that the difference is not solely due to feature count. This revision will strengthen the causal interpretation. revision: partial
Referee: [Methods / SAE training details] The sparsity penalty in SAE training is known to induce negative correlations between features that compete for the same residual-stream directions, independent of input entropy. The within-domain split controls topic but does not control for the number or overlap of active features; without an ablation that matches the number of active features across entropy regimes or compares against a non-sparse baseline, the claim that rivalry is a signature of uncertainty rather than a training artifact remains under-supported.

Authors: We agree that the sparsity penalty can induce negative correlations as a training artifact. Our within-domain split helps control for topic, but we acknowledge the need for further controls on feature activation counts. In the revision, we will include an ablation where we subsample prompts or features to match the number of active features between high- and low-entropy conditions and re-compute the rivalry differences. We will also discuss why a non-sparse baseline is not directly applicable here, as our focus is on the interpretable SAE features. These additions will better isolate the role of uncertainty. revision: yes
Referee: [Activation steering experiments] The steering results state that rivalry-direction interventions cause more output changes than random directions at low multipliers, yet no error bars, exact number of trials per pair, or magnitude-matched controls for the steering vector length are reported. Without these, it is impossible to determine whether the effect is robust or driven by a few outlier pairs.

Authors: We appreciate this point on the reporting of the steering experiments. In the revised manuscript, we will include error bars (standard error across trials), specify the exact number of trials per pair (we used 50 generations per steering vector), and add magnitude-matched controls by normalizing all steering vectors to unit length before applying multipliers. This will confirm that the observed differences are not due to vector magnitude or outliers. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements and interventions with no self-referential derivations

full rationale

The paper defines feature rivalry directly as negatively correlated SAE feature pairs and measures it via cosine similarities on decoder vectors from held-out prompts. Statistical comparisons (p-values on entropy splits), steering experiments, and AUROC prediction of correctness are all downstream empirical tests on independent data splits. No equations reduce a claimed result to its own inputs by construction, no parameters are fitted then relabeled as predictions, and no load-bearing claims rest on self-citations. The per-prompt rivalry score is computed from observed activations rather than being tautological with the entropy label or uncertainty measure.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that Gemma Scope SAEs produce features whose pairwise negative correlations can be interpreted as rivalry and that response entropy is a valid proxy for model uncertainty; no explicit free parameters are stated in the abstract.

axioms (2)

domain assumption SAE features are sufficiently monosemantic and stable to support mechanistic claims about rivalry
Invoked when treating decoder vectors as meaningful units for cosine similarity and steering.
domain assumption PopQA entropy split cleanly separates low- and high-uncertainty regimes without confounding difficulty factors
Used to compare rivalry strength between conditions.

invented entities (1)

Feature Rivalry no independent evidence
purpose: Label for negatively correlated SAE feature pairs hypothesized to reflect uncertainty
Newly introduced construct; no independent evidence outside the paper's definitions and measurements.

pith-pipeline@v0.9.0 · 5511 in / 1552 out tokens · 54318 ms · 2026-05-12T01:09:13.399881+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

[1]

Transformer Circuits Thread , year=

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author=. Transformer Circuits Thread , year=

work page
[2]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. arXiv preprint arXiv:2309.08600 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Proceedings of the 8th BlackboxNLP Workshop , pages=

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author=. Proceedings of the 8th BlackboxNLP Workshop , pages=

work page
[4]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year=

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year=

work page
[5]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving Open Language Models at a Practical Size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

International Conference on Learning Representations , year=

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. International Conference on Learning Representations , year=

work page
[7]

Uncertainty estimation in autoregressive structured prediction

Uncertainty Estimation in Autoregressive Structured Prediction , author=. arXiv preprint arXiv:2002.07650 , year=

work page arXiv 2002
[8]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author=. arXiv preprint arXiv:2403.19647 , year=

work page internal anchor Pith review arXiv
[9]

Transformer Circuits Thread , year=

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. Transformer Circuits Thread , year=

work page
[10]

Proceedings of the 8th BlackboxNLP Workshop , pages=

PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders , author=. Proceedings of the 8th BlackboxNLP Workshop , pages=

work page
[11]

Proceedings of the 8th BlackboxNLP Workshop , pages=

Steering Prepositional Phrases in Language Models: A Case of with-headed Adjectival and Adverbial Complements in Gemma-2 , author=. Proceedings of the 8th BlackboxNLP Workshop , pages=

work page
[12]

Gulko, Alex and Peng, Yusen and Kumar, Sachin , booktitle=

work page
[13]

Proceedings of the 8th BlackboxNLP Workshop , pages=

Language Dominance in Multilingual Large Language Models , author=. Proceedings of the 8th BlackboxNLP Workshop , pages=

work page
[14]

Proceedings of the 7th BlackboxNLP Workshop , year=

Log Probabilities Are a Reliable Estimate of Semantic Plausibility in Base and Instruction-Tuned Language Models , author=. Proceedings of the 7th BlackboxNLP Workshop , year=

work page
[15]

2024 , url=

SAELens , author=. 2024 , url=

work page 2024
[16]

Transformer Circuits Thread , year=

In-context Learning and Induction Heads , author=. Transformer Circuits Thread , year=

work page