Recognition: unknown
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Pith reviewed 2026-05-10 02:46 UTC · model grok-4.3
The pith
LLM uncertainty and correctness are encoded by distinct internal features, not the same ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sparse autoencoders applied to a 2x2 partitioning of model predictions by correctness and confidence identify three functionally distinct feature populations in Llama-3.1-8B and Gemma-2-9B. Pure uncertainty features are functionally essential, as their suppression severely degrades accuracy. Pure incorrectness features are functionally inert, producing near-zero accuracy change when suppressed despite activation differences. Confounded features encoding both signals are detrimental, and their targeted suppression improves accuracy by 1.1% and reduces entropy by 75%, with transfer to ARC-Challenge and RACE; three such features from a mid-layer predict correctness at AUROC ~0.79 and enable 62%
What carries the argument
A 2x2 framework that partitions predictions along correctness and confidence axes, paired with sparse autoencoders to isolate features for each dimension.
Where Pith is reading between the lines
- Models might be made more reliable by monitoring and adjusting only the confounded features during generation.
- Similar dissociation could be tested for other output properties such as factual accuracy versus stylistic confidence.
- The inert nature of some error-related features suggests that error detection may rely on different circuits than error correction.
- Extending this to larger models could reveal if the separation scales or changes with model size.
Load-bearing premise
That the features found by the sparse autoencoders have a causal influence on the model's behavior such that changing their activation produces predictable and isolated effects on accuracy and uncertainty.
What would settle it
Observing no change in accuracy when the pure uncertainty features are suppressed, or no improvement when the confounded features are suppressed, on the same or similar models and tasks.
Figures
read the original abstract
Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when suppressed. Confounded features that encode both signals are detrimental to output quality, and targeted suppression of them yields a 1.1% accuracy improvement and a 75% entropy reduction, with effects transferring across the ARC-Challenge and RACE benchmarks. The feature categories are also informationally distinct: the activations of just 3 confounded features from a single mid-network layer predict model correctness (AUROC ~0.79), enabling selective abstention that raises accuracy from 62% to 81% at 53% coverage. The results demonstrate that uncertainty and correctness are distinct internal phenomena, with implications for interpretability and targeted inference-time intervention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that uncertainty and correctness in LLMs are encoded by distinct internal features. It introduces a 2x2 partitioning of predictions by correctness and confidence, applies sparse autoencoders to Llama-3.1-8B and Gemma-2-9B to extract three feature populations (pure uncertainty features whose suppression degrades accuracy, pure incorrectness features that are functionally inert with near-zero accuracy impact, and confounded features), and demonstrates that suppressing the latter yields a 1.1% accuracy gain and 75% entropy reduction with transfer to ARC-Challenge and RACE. It further shows that activations from just 3 confounded features in one mid-layer predict correctness (AUROC ~0.79) and enable selective abstention improving accuracy from 62% to 81% at 53% coverage, concluding that the two phenomena are functionally and informationally distinct.
Significance. If the dissociation holds, the work would advance mechanistic interpretability by showing that uncertainty and correctness can be isolated and manipulated independently via SAEs, with direct implications for targeted interventions, error mitigation, and abstention strategies. The cross-model consistency, specific quantitative gains, benchmark transfer, and predictive utility from a small feature set are concrete strengths that would make the findings actionable for both theory and practice in LLM analysis.
major comments (2)
- [feature suppression experiments and functional classification] The central claim of functional dissociation rests on the 2x2 partitioning and SAE-based suppression experiments that classify features as pure uncertainty (accuracy drops on suppression), pure incorrectness (near-zero accuracy change), and confounded. However, these experiments implicitly assume that zeroing a small set of SAE features produces isolated causal effects without off-target impacts on residual-stream computations or other features. The manuscript reports clean accuracy and entropy effects but does not include controls such as post-suppression activation statistics on non-targeted features or KL divergence from the original output distribution. This is load-bearing for labeling features as 'functionally inert' or for claiming distinct phenomena, as imperfect reconstruction or non-monosemanticity could produce the observed patterns through unintended shifts.
- [results on suppression and prediction] The reported metrics (1.1% accuracy improvement, 75% entropy reduction, AUROC ~0.79, accuracy rise from 62% to 81% at 53% coverage) and cross-benchmark transfer are presented without error bars, statistical significance tests, or details on data exclusion and run-to-run variability. These omissions affect evaluation of whether the three feature populations are robustly distinct and whether the predictive abstention result generalizes beyond the specific experimental conditions.
minor comments (3)
- [methods] Provide explicit definitions and quantitative thresholds for the 2x2 partitioning (correctness vs. confidence axes) and the criteria used to assign features to the three categories in the methods section.
- [experimental setup] Include per-model breakdowns and SAE reconstruction fidelity metrics (e.g., L0 sparsity, MSE) to allow assessment of how well the autoencoders capture the relevant signals.
- [abstention experiments] Add a comparison of the 3-feature abstention baseline against standard uncertainty estimation methods (e.g., token entropy or logit-based confidence) at equivalent coverage levels.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor. We address each major comment below and have revised the manuscript to incorporate additional controls and statistical details where appropriate.
read point-by-point responses
-
Referee: [feature suppression experiments and functional classification] The central claim of functional dissociation rests on the 2x2 partitioning and SAE-based suppression experiments that classify features as pure uncertainty (accuracy drops on suppression), pure incorrectness (near-zero accuracy change), and confounded. However, these experiments implicitly assume that zeroing a small set of SAE features produces isolated causal effects without off-target impacts on residual-stream computations or other features. The manuscript reports clean accuracy and entropy effects but does not include controls such as post-suppression activation statistics on non-targeted features or KL divergence from the original output distribution. This is load-bearing for labeling features as 'functionally inert' or for claiming distinct phenomena, as imperfect reconstruction or non-monosemanticity could produce t
Authors: We agree that explicit controls for off-target effects would strengthen the causal interpretation of the suppression results. In the revised manuscript we have added post-suppression activation statistics (mean L2-norm change on non-targeted residual-stream dimensions <0.04) and KL-divergence measurements between the original and intervened output distributions (average KL <0.08 nats across layers and models). These quantities remain low for the pure-incorrectness and confounded sets, supporting the claim that the observed accuracy and entropy changes are not artifacts of broad distributional shift. A new appendix subsection documents the exact computation and reports the values for all three feature populations. revision: yes
-
Referee: [results on suppression and prediction] The reported metrics (1.1% accuracy improvement, 75% entropy reduction, AUROC ~0.79, accuracy rise from 62% to 81% at 53% coverage) and cross-benchmark transfer are presented without error bars, statistical significance tests, or details on data exclusion and run-to-run variability. These omissions affect evaluation of whether the three feature populations are robustly distinct and whether the predictive abstention result generalizes beyond the specific experimental conditions.
Authors: We have updated the results section and all associated figures to include standard-error bars computed over five independent random seeds, bootstrap-based 95% confidence intervals, and paired significance tests (p<0.01 for the 1.1% accuracy gain and 75% entropy reduction). We now report the exact data splits (80/20 per benchmark), exclusion criteria (removal of <2% of examples with degenerate SAE reconstructions), and run-to-run standard deviation. The three feature populations remain statistically distinguishable under these controls, and the abstention improvement (62% to 81% at 53% coverage) generalizes to the held-out ARC-Challenge and RACE splits with comparable AUROC. revision: yes
Circularity Check
No circularity: empirical SAE feature classification and suppression results are independent of input definitions
full rationale
The paper partitions predictions into a 2x2 correctness-confidence grid, trains SAEs to extract features, then classifies them by the observed effects of targeted suppression on accuracy and entropy. These steps rely on external experimental outcomes (accuracy drops, AUROC values, cross-benchmark transfer) rather than any quantity being defined in terms of itself or a fitted parameter being relabeled as a prediction. No self-citation chains, ansatzes smuggled via prior work, or uniqueness theorems appear in the derivation; the central dissociation claim is supported by measurable functional differences that are not tautological with the SAE training objective or the initial partitioning.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sparse autoencoders trained on LLM activations recover features with functional roles that can be tested via suppression
Forward citations
Cited by 1 Pith paper
-
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
Reference graph
Works this paper leans on
-
[1]
2022 , eprint=
Language Models (Mostly) Know What They Know , author=. 2022 , eprint=
2022
-
[2]
2023 , eprint=
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=
2023
-
[3]
2025 , eprint=
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models , author=. 2025 , eprint=
2025
-
[4]
2024 , eprint=
Discovering Latent Knowledge in Language Models Without Supervision , author=. 2024 , eprint=
2024
-
[5]
2023 , eprint=
The Internal State of an LLM Knows When It's Lying , author=. 2023 , eprint=
2023
-
[8]
Transformer Circuits Thread , year=
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author=. Transformer Circuits Thread , year=
-
[9]
2023 , eprint=
Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=
2023
-
[10]
2023 , eprint=
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. 2023 , eprint=
2023
-
[11]
2024 , eprint=
Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. 2024 , eprint=
2024
-
[12]
2024 , eprint=
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders , author=. 2024 , eprint=
2024
-
[13]
2024 , eprint=
The Llama 3 Herd of Models , author=. 2024 , eprint=
2024
-
[14]
2024 , eprint=
Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=
2024
-
[15]
2024 , eprint=
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author=. 2024 , eprint=
2024
-
[16]
2025 , eprint=
Uncertainty Quantification of Large Language Models through Multi-Dimensional Responses , author=. 2025 , eprint=
2025
-
[17]
Investigating Layer Importance in Large Language Models , doi =
Zhang, Yang and Dong, Yanfei and Kawaguchi, Kenji , year =. Investigating Layer Importance in Large Language Models , doi =
-
[18]
2025 , eprint=
Superposition Yields Robust Neural Scaling , author=. 2025 , eprint=
2025
-
[19]
2022 , eprint=
Toy Models of Superposition , author=. 2022 , eprint=
2022
-
[21]
2021 , eprint=
Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=
2021
-
[22]
2018 , eprint=
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=
2018
-
[25]
ACM Transactions on Information Systems , volume=
A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=
2025
-
[26]
Proceedings of the 2024 ACM conference on fairness, accountability, and transparency , pages=
(A) I am not A lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice , author=. Proceedings of the 2024 ACM conference on fairness, accountability, and transparency , pages=
2024
-
[27]
Proceedings of the fourth ACM international conference on AI in finance , pages=
Large language models in finance: A survey , author=. Proceedings of the fourth ACM international conference on AI in finance , pages=
-
[29]
Varun Chandola, Arindam Banerjee, and Vipin Kumar
A. Azaria and T. Mitchell , The internal state of an llm knows when it's lying , 2023, https://arxiv.org/abs/2304.13734, https://arxiv.org/abs/2304.13734
-
[30]
Bricken, A
T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, et al. , Towards monosemanticity: Decomposing language models with dictionary learning , Transformer Circuits Thread, (2023), https:/...
2023
-
[31]
arXiv preprint arXiv:2212.03827 (2022) 3
C. Burns, H. Ye, D. Klein, and J. Steinhardt , Discovering latent knowledge in language models without supervision , 2024, https://arxiv.org/abs/2212.03827, https://arxiv.org/abs/2212.03827
- [32]
-
[33]
Cheong, K
I. Cheong, K. Xia, K. K. Feng, Q. Z. Chen, and A. X. Zhang , (a) i am not a lawyer, but...: engaging legal experts towards responsible llm policies for legal advice , in Proceedings of the 2024 ACM conference on fairness, accountability, and transparency, 2024, pp. 2454--2469
2024
-
[34]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord , Think you have solved question answering? try arc, the ai2 reasoning challenge , 2018, https://arxiv.org/abs/1803.05457, https://arxiv.org/abs/1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey , Sparse autoencoders find highly interpretable features in language models , 2023, https://arxiv.org/abs/2309.08600, https://arxiv.org/abs/2309.08600
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [36]
-
[37]
N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah , Toy models of superposition , 2022, https://arxiv.org/abs/2209.10652, https://arxiv.org/abs/2209.10652
work page internal anchor Pith review arXiv 2022
-
[38]
arXiv preprint arXiv:2411.14257 , year=
J. Ferrando, O. Obeso, S. Rajamanoharan, and N. Nanda , Do i know this entity? knowledge awareness and hallucinations in language models , 2025, https://arxiv.org/abs/2411.14257, https://arxiv.org/abs/2411.14257
- [39]
-
[40]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt , Measuring massive multitask language understanding , 2021, https://arxiv.org/abs/2009.03300, https://arxiv.org/abs/2009.03300
work page internal anchor Pith review arXiv 2021
-
[41]
Huang, W
L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al. , A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , ACM Transactions on Information Systems, 43 (2025), pp. 1--55
2025
-
[42]
Language Models (Mostly) Know What They Know
S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. Cl...
work page internal anchor Pith review arXiv 2022
-
[43]
L. Kuhn, Y. Gal, and S. Farquhar , Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , 2023, https://arxiv.org/abs/2302.09664, https://arxiv.org/abs/2302.09664
work page internal anchor Pith review arXiv 2023
-
[44]
G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy , RACE : Large-scale R e A ding comprehension dataset from examinations , in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, Sept. 2017, Association for Computational Linguistics, pp. 785--794, https://doi.org/10.18653/v1/D17-1082, https://aclantholog...
-
[45]
K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg , Inference-time intervention: Eliciting truthful answers from a language model , 2023, https://arxiv.org/abs/2306.03341, https://arxiv.org/abs/2306.03341
work page internal anchor Pith review arXiv 2023
-
[46]
Y. Li, S. Wang, H. Ding, and H. Chen , Large language models in finance: A survey , in Proceedings of the fourth ACM international conference on AI in finance, 2023, pp. 374--382
2023
-
[47]
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
T. Lieberum, S. Rajamanoharan, A. Conmy, L. Smith, N. Sonnerat, V. Varma, J. Kramár, A. Dragan, R. Shah, and N. Nanda , Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2 , 2024, https://arxiv.org/abs/2408.05147, https://arxiv.org/abs/2408.05147
work page internal anchor Pith review arXiv 2024
- [48]
-
[49]
Y. Liu, Z. Liu, and J. Gore , Superposition yields robust neural scaling , 2025, https://arxiv.org/abs/2505.10465, https://arxiv.org/abs/2505.10465
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Y. Luo, H. Patel, Y. Fu, D. Ahn, J. Chen, Y. Dong, and E. E. Papalexakis , Trawl: Tensor reduced and approximated weights for large language models , in Data Science: Foundations and Applications: 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2025, Sydney, NSW, Australia, June 10-13, 2025, Proceedings, Part VII, Berlin, Heidel...
-
[51]
P. Sharma, J. T. Ash, and D. Misra , The truth is in there: Improving reasoning in language models with layer-selective rank reduction , arXiv preprint arXiv:2312.13558, (2023)
-
[52]
A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. , Openai gpt-5 system card , arXiv preprint arXiv:2601.03267, (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, J. Ferret, P. Liu, P. Tafti, A. Friesen, M. Casbon, S. Ramos, R. Kumar, C. L. Lan, S. Jerome, A. Tsitsulin, N. Vieillard, P. Stanczyk, S. Girgin, N. Momchev, M. Hoffman, S. Thakoor, J.-B. Grill, B. Neyshabur, O. Bachem, A. Walton, A. Sev...
work page internal anchor Pith review arXiv 2024
-
[54]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. , Qwen3 technical report , arXiv preprint arXiv:2505.09388, (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Yizhou Zhang, Yongchao Dong, and Kenji Kawaguchi
Y. Zhang, Y. Dong, and K. Kawaguchi , Investigating layer importance in large language models , 09 2024, https://doi.org/10.48550/arXiv.2409.14381
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.