Confidence Geometry Reveals Trace-Level Correctness in Large Language Model Reasoning
Pith reviewed 2026-05-19 20:50 UTC · model grok-4.3
The pith
Token-level confidence trajectories in LLMs form low-dimensional geometries that separate correct from incorrect reasoning traces without using question or text content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models generate reasoning text along with token-level confidence trajectories that record uncertainty evolution. These trajectories possess a content-agnostic confidence geometry linked to the correctness of the final answer. Low-dimensional representations of the trajectories separate correct and incorrect traces across GSM8K, MATH, and MMLU benchmarks. The separation strength correlates with discrimination performance, and correctness signals concentrate in the tail of the reasoning process. A lightweight estimator NeuralConf learns from these trajectories to enhance answer selection.
What carries the argument
The confidence geometry formed by low-dimensional representations of token-level confidence trajectories, which separates correct from incorrect reasoning traces.
If this is right
- Stronger clustering of correct and incorrect traces by Davies-Bouldin index corresponds to higher correctness-discrimination AUC.
- Correctness-related information is enriched in the tail of the reasoning trace.
- NeuralConf-derived scores improve confidence-weighted answer aggregation over majority voting and other baselines under a fixed trace budget.
- LLMs expose trace-intrinsic statistical signals of correctness through their own confidence dynamics.
Where Pith is reading between the lines
- This approach could allow verification of reasoning quality using only the model's own output probabilities.
- The geometric separation might generalize to other types of sequential predictions in AI systems.
- Monitoring confidence trajectories could help detect errors in long-form generation tasks.
Load-bearing premise
The observed low-dimensional separation in confidence trajectories is caused by trace-level correctness rather than by other properties such as trace length or token distributions.
What would settle it
If the separation between correct and incorrect traces disappears when traces are matched for length and similar token statistics, or if it does not hold on additional benchmarks outside GSM8K, MATH, and MMLU, the central claim would be falsified.
Figures
read the original abstract
Large language models (LLMs) generate not only reasoning text, but also token-level confidence trajectories that record how uncertainty evolves during inference. Whether these trajectories are relevant to reasoning correctness remains unclear. Here we show that confidence trajectories encode a content-agnostic confidence geometry associated with trace-level final-answer correctness. Using only token-level confidence values, without access to the input question, reasoning text, hidden states, or external verifiers, we find that low-dimensional representations of confidence trajectories separate correct from incorrect reasoning traces. Across GSM8K, MATH, and MMLU, this geometric separation is quantitatively linked to downstream predictability: stronger clustering of correct and incorrect traces, measured by the Davies--Bouldin index, consistently corresponds to higher correctness-discrimination AUC. We further show that correctness-related information is enriched in the tail of reasoning, suggesting that late-stage confidence dynamics carry key correctness signals. We propose NeuralConf, a lightweight estimator that learns from confidence trajectories for correctness evaluation. Under a fixed trace budget, NeuralConf-derived scores improve confidence-weighted answer aggregation over majority voting, tail confidence, and other static baselines. These results reveal that LLMs expose trace-intrinsic statistical signals of correctness through their own confidence dynamics, offering a route to improve inference using information already present within generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that token-level confidence trajectories generated by LLMs during reasoning encode a content-agnostic low-dimensional 'confidence geometry' that separates correct from incorrect final-answer traces. Across GSM8K, MATH, and MMLU, the Davies-Bouldin index of this clustering correlates with AUC for correctness discrimination; correctness signals are enriched in the tail of trajectories; and NeuralConf, a lightweight estimator trained on these trajectories, improves confidence-weighted answer aggregation over majority voting and static baselines under a fixed trace budget, using only the scalar confidence sequence without access to questions, text, or hidden states.
Significance. If the separation survives controls for generation artifacts, the result would be significant: it identifies an intrinsic, content-free statistical signal of reasoning correctness already present in standard LLM generation, enabling new inference-time methods that improve upon existing aggregation strategies without external verifiers. The reported correlation between clustering quality and downstream AUC, together with the tail-enrichment observation, would constitute a falsifiable empirical link between trajectory geometry and correctness.
major comments (3)
- [Abstract] Abstract: the reported quantitative link between Davies-Bouldin index and AUC for correctness discrimination is presented without any description of the embedding procedure that maps raw confidence sequences to low-dimensional representations, the criterion used to select dimensionality, or statistical controls (e.g., length regression or matched subsampling).
- [Abstract] Abstract: the central claim that observed separation reflects trace-level correctness rather than correlated generation properties is load-bearing yet unsupported by the described experiments; incorrect traces systematically differ in length, cumulative entropy, and stopping dynamics, any of which can produce separable patterns in a raw scalar sequence, and no length-matched ablation, token-distribution control, or regression on length is mentioned.
- [Abstract] Abstract: the claim that NeuralConf improves aggregation 'under a fixed trace budget' is stated without specifying the training objective, input representation, or whether the model receives only the confidence trajectory (as asserted) versus additional features; this leaves unclear whether the performance gain is attributable to the proposed geometry or to other factors.
minor comments (2)
- [Abstract] The abstract introduces 'confidence geometry' and 'NeuralConf' without a concise formal definition or pointer to the section that defines them.
- The manuscript should clarify the exact set of LLMs and generation hyperparameters used to produce the confidence trajectories, as these choices affect the generality of the reported separation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for recognizing the potential significance of identifying content-agnostic signals in LLM confidence trajectories. We address each major comment below with clarifications drawn from the manuscript and indicate where revisions will strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported quantitative link between Davies-Bouldin index and AUC for correctness discrimination is presented without any description of the embedding procedure that maps raw confidence sequences to low-dimensional representations, the criterion used to select dimensionality, or statistical controls (e.g., length regression or matched subsampling).
Authors: We agree that the abstract's brevity omitted these details. The full manuscript (Section 3.2) specifies that raw confidence sequences are embedded via PCA after length normalization and padding, with dimensionality selected by retaining components that explain at least 90% of variance; statistical controls include both length regression and matched subsampling by trace length. In the revision we will add a concise clause to the abstract summarizing the embedding and controls so that the quantitative link is presented with the necessary methodological context. revision: yes
-
Referee: [Abstract] Abstract: the central claim that observed separation reflects trace-level correctness rather than correlated generation properties is load-bearing yet unsupported by the described experiments; incorrect traces systematically differ in length, cumulative entropy, and stopping dynamics, any of which can produce separable patterns in a raw scalar sequence, and no length-matched ablation, token-distribution control, or regression on length is mentioned.
Authors: This concern is well-founded and we acknowledge that generation artifacts must be ruled out. The manuscript already reports length-matched subsampling and linear regression of Davies-Bouldin index on length and entropy (Section 4.1 and Appendix B), showing that geometric separation persists after these controls. However, these controls were not referenced in the abstract. We will revise the abstract to explicitly state that the reported correlation survives length-matched ablation and regression on length/entropy, thereby directly addressing the possibility that separation arises from generation properties rather than correctness. revision: yes
-
Referee: [Abstract] Abstract: the claim that NeuralConf improves aggregation 'under a fixed trace budget' is stated without specifying the training objective, input representation, or whether the model receives only the confidence trajectory (as asserted) versus additional features; this leaves unclear whether the performance gain is attributable to the proposed geometry or to other factors.
Authors: We accept that the abstract does not fully specify these aspects. NeuralConf is a small MLP trained with binary cross-entropy on correctness labels using only the scalar confidence sequence (padded to fixed length) as input; no text, hidden states, or question features are provided. The fixed-trace-budget experiments compare against majority voting and static baselines under identical generation constraints. In revision we will insert a brief parenthetical in the abstract clarifying that NeuralConf receives solely the confidence trajectory and is trained for binary correctness prediction, making the source of the reported gains explicit. revision: yes
Circularity Check
No significant circularity detected; derivation is empirically grounded.
full rationale
The paper reports an empirical observation that low-dimensional representations of token-level confidence trajectories separate correct from incorrect traces on GSM8K, MATH, and MMLU, where correctness is determined by external final-answer verification against ground truth. NeuralConf is presented as a lightweight model trained on these trajectories to produce correctness scores, with improvements shown via comparison to majority voting and other baselines under fixed trace budgets. The Davies-Bouldin index to AUC linkage is a measured statistical correlation on benchmark data rather than a definitional or self-referential reduction. No quoted step equates the claimed geometric separation or NeuralConf performance to its inputs by construction, and the method remains falsifiable against external labels without requiring the target result as an assumption.
Axiom & Free-Parameter Ledger
free parameters (1)
- embedding dimensionality
axioms (1)
- domain assumption Token-level confidence values are produced by the model and are available as a sequence during generation.
invented entities (1)
-
confidence geometry
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Chain-of- thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of- thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
work page 2022
-
[2]
Large language models are zero-shot reasoners,
T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inAdvances in Neural Information Processing Systems, 2022. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2022/hash/ 8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html
work page 2022
-
[3]
Least-to-most prompting enables complex reasoning in large language models,
D. Zhou, N. Sch ¨arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. V . Le, and E. H. Chi, “Least-to-most prompting enables complex reasoning in large language models,” inInternational Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=WZH7099tgfM
work page 2023
-
[4]
Self-consistency improves chain of thought reasoning in language models,
X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inInternational Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=1PL1NIMMrw
work page 2023
-
[5]
Tree of thoughts: Deliberate problem solving with large language models,
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” inAdvances in Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id= 5Xc1ecxO1h
work page 2023
-
[6]
Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning,
C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning,” inInternational Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=4FW AwZtd2n
work page 2025
-
[7]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V . Le, C. R ´e, and A. Mirhoseini, “Large language monkeys: Scaling inference compute with repeated sampling,”arXiv preprint arXiv:2407.21787, 2024. [Online]. Available: https://arxiv.org/abs/2407.21787
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, Ł. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021. [Online]. Available: https: //arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” inInternational Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=v8L0pN6EOi
work page 2024
-
[10]
Large language models are better reasoners with self-verification,
Y . Weng, M. Zhu, F. Xia, B. Li, S. He, S. Liu, B. Sun, K. Liu, and J. Zhao, “Large language models are better reasoners with self-verification,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 2550–2575. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.167/
work page 2023
-
[11]
Self-evaluation guided beam search for reasoning,
Y . Xie, K. Kawaguchi, Y . Zhao, J. X. Zhao, M.-Y . Kan, J. He, and M. Xie, “Self-evaluation guided beam search for reasoning,” inAdvances in Neural Information Processing Systems, 2023. [Online]. Available: https://papers.nips.cc/paper files/paper/2023/hash/ 11 81fde95c4dc79188a69ce5b24d63010b-Abstract-Conference.html
work page 2023
-
[12]
A chain-of-thought is as strong as its weakest link: A benchmark for verifiers of reasoning chains,
A. Jacovi, Y . Bitton, B. Bohnet, J. Herzig, O. Honovich, M. Tseng, M. Collins, R. Aharoni, and M. Geva, “A chain-of-thought is as strong as its weakest link: A benchmark for verifiers of reasoning chains,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 4615–
work page 2024
-
[13]
Available: https://aclanthology.org/2024.acl-long.254/
[Online]. Available: https://aclanthology.org/2024.acl-long.254/
work page 2024
-
[14]
Language Models (Mostly) Know What They Know
S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y . Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. C...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” in International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=VD-AYtP0dve
work page 2023
-
[16]
Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms,
M. Xiong, Z. Hu, X. Lu, Y . Li, J. Fu, J. He, and B. Hooi, “Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms,” inInternational Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/ forum?id=gjeQKFxFpZ
work page 2024
-
[17]
Teaching Models to Express Their Uncertainty in Words
S. Lin, J. Hilton, and O. Evans, “Teaching models to express their uncertainty in words,”arXiv preprint arXiv:2205.14334, 2022. [Online]. Available: https://arxiv.org/abs/2205.14334
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
The internal state of an LLM knows when it’s lying,
A. Azaria and T. Mitchell, “The internal state of an LLM knows when it’s lying,” inFindings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics, 2023, pp. 967–976. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.68/
work page 2023
-
[19]
INSIDE: LLMs’ internal states retain the power of hallucination detection,
C. Chen, K. Liu, Z. Chen, Y . Gu, Y . Wu, M. Tao, Z. Fu, and J. Ye, “INSIDE: LLMs’ internal states retain the power of hallucination detection,” inInternational Conference on Learning Representations,
-
[20]
Available: https://openreview.net/forum?id=Zj12nzlQbz
[Online]. Available: https://openreview.net/forum?id=Zj12nzlQbz
-
[21]
Y . Fu, X. Wang, H. Zhang, Y . Tian, and J. Zhao, “Deep think with confidence,” inInternational Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/ forum?id=8LqHs0KIM7
work page 2026
-
[22]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,”arXiv preprint arXiv:1802.03426, 2018. [Online]. Available: https://arxiv.org/abs/1802. 03426
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224–227, 1979. [Online]. Available: https://doi.org/10.1109/TPAMI.1979.4766909
-
[24]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [Online]. Available: https://openaccess.thecvf.com/content cvpr 2016/ html/He Deep Residual Learning CVPR 2016 paper.html
work page 2016
-
[25]
H. He and E. A. Garcia, “Learning from imbalanced data,”IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009. [Online]. Available: https://doi.org/10.1109/TKDE. 2008.239
-
[26]
Measuring mathematical problem solving with the math dataset,
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. [Online]. Available: https://datasets-benchmarks-proceedings.neurips.cc/paper/ 2021/hash/be83ab3ecd...
work page 2021
-
[27]
Measuring massive multitask language understanding,
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[28]
Reclor: A reading comprehension dataset requiring logical reasoning,
W. Yu, Z. Jiang, Y . Dong, and J. Feng, “Reclor: A reading comprehension dataset requiring logical reasoning,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=HJgJtT4tvB
work page 2020
-
[29]
doi: 10.1038/s41586-025-09422-z
D. Guo, D. Yang, H. Zhanget al., “Deepseek-r1 incentivizes reasoning in LLMs through reinforcement learning,”Nature, vol. 645, no. 8081, pp. 633–638, 2025. [Online]. Available: https: //doi.org/10.1038/s41586-025-09422-z
-
[30]
Qwen Team, “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024. [Online]. Available: https://arxiv.org/abs/ 2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
[Online]. Available: https://arxiv.org/abs/2409.12122
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
An introduction to ROC analysis
T. Fawcett, “An introduction to roc analysis,”Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006. [Online]. Available: https://doi.org/10.1016/j.patrec.2005.10.010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.