pith. sign in

arxiv: 2605.16824 · v1 · pith:YWL5L3BMnew · submitted 2026-05-16 · 💻 cs.LG · cs.CL

Confidence Geometry Reveals Trace-Level Correctness in Large Language Model Reasoning

Pith reviewed 2026-05-19 20:50 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords confidence trajectoriesLLM reasoningcorrectness detectiontoken-level confidencegeometric separationNeuralConftrace-level correctness
0
0 comments X

The pith

Token-level confidence trajectories in LLMs form low-dimensional geometries that separate correct from incorrect reasoning traces without using question or text content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large language models produce token-level confidence values whose trajectories encode information about whether the final answer is correct. These trajectories can be embedded into low-dimensional spaces where correct and incorrect traces cluster separately, even when no information about the question or the reasoning text is available. The strength of this separation, as measured by clustering metrics, predicts how well one can discriminate correct answers. The authors introduce a simple model called NeuralConf that uses these trajectories to score answers and improve aggregation methods over simple voting.

Core claim

Large language models generate reasoning text along with token-level confidence trajectories that record uncertainty evolution. These trajectories possess a content-agnostic confidence geometry linked to the correctness of the final answer. Low-dimensional representations of the trajectories separate correct and incorrect traces across GSM8K, MATH, and MMLU benchmarks. The separation strength correlates with discrimination performance, and correctness signals concentrate in the tail of the reasoning process. A lightweight estimator NeuralConf learns from these trajectories to enhance answer selection.

What carries the argument

The confidence geometry formed by low-dimensional representations of token-level confidence trajectories, which separates correct from incorrect reasoning traces.

If this is right

  • Stronger clustering of correct and incorrect traces by Davies-Bouldin index corresponds to higher correctness-discrimination AUC.
  • Correctness-related information is enriched in the tail of the reasoning trace.
  • NeuralConf-derived scores improve confidence-weighted answer aggregation over majority voting and other baselines under a fixed trace budget.
  • LLMs expose trace-intrinsic statistical signals of correctness through their own confidence dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could allow verification of reasoning quality using only the model's own output probabilities.
  • The geometric separation might generalize to other types of sequential predictions in AI systems.
  • Monitoring confidence trajectories could help detect errors in long-form generation tasks.

Load-bearing premise

The observed low-dimensional separation in confidence trajectories is caused by trace-level correctness rather than by other properties such as trace length or token distributions.

What would settle it

If the separation between correct and incorrect traces disappears when traces are matched for length and similar token statistics, or if it does not hold on additional benchmarks outside GSM8K, MATH, and MMLU, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.16824 by Ding Liu, Shi-Ju Ran, Shuo Liu.

Figure 1
Figure 1. Figure 1: Overview of the confidence-only readout protocol. A frozen backbone LLM samples multiple reasoning traces for each question. Each trace produces [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representation-level evidence that confidence trajectories contain correctness-related structure. UMAP [20] visualizations of raw confidence trajectories [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dependence of representation quality and trace-level discrimination on the maximum input length. Top, DBI of NeuralConf embeddings across input [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Positional organization of correctness-related structure along the confidence trajectory. DBI and trace-level AUC are shown as functions of window [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scale-dependent recoverability of correctness-related signals on [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Trace-level AUC of Bottom-10Conf as a function of grouping length on GSM8K, MATH and MMLU. Bottom-10Conf is computed from the full [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distributional comparison of trace-level scores for incorrect and correct traces on GSM8K, MATH and MMLU. Rows correspond to datasets and [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Trace-level accuracy as a function of the maximum input length on GSM8K, MATH and MMLU, shown together with DBI for reference. Results are [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Matched comparison between head-aligned and tail-aligned inputs across maximum input lengths on GSM8K, MATH and MMLU. For each [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Large language models (LLMs) generate not only reasoning text, but also token-level confidence trajectories that record how uncertainty evolves during inference. Whether these trajectories are relevant to reasoning correctness remains unclear. Here we show that confidence trajectories encode a content-agnostic confidence geometry associated with trace-level final-answer correctness. Using only token-level confidence values, without access to the input question, reasoning text, hidden states, or external verifiers, we find that low-dimensional representations of confidence trajectories separate correct from incorrect reasoning traces. Across GSM8K, MATH, and MMLU, this geometric separation is quantitatively linked to downstream predictability: stronger clustering of correct and incorrect traces, measured by the Davies--Bouldin index, consistently corresponds to higher correctness-discrimination AUC. We further show that correctness-related information is enriched in the tail of reasoning, suggesting that late-stage confidence dynamics carry key correctness signals. We propose NeuralConf, a lightweight estimator that learns from confidence trajectories for correctness evaluation. Under a fixed trace budget, NeuralConf-derived scores improve confidence-weighted answer aggregation over majority voting, tail confidence, and other static baselines. These results reveal that LLMs expose trace-intrinsic statistical signals of correctness through their own confidence dynamics, offering a route to improve inference using information already present within generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that token-level confidence trajectories generated by LLMs during reasoning encode a content-agnostic low-dimensional 'confidence geometry' that separates correct from incorrect final-answer traces. Across GSM8K, MATH, and MMLU, the Davies-Bouldin index of this clustering correlates with AUC for correctness discrimination; correctness signals are enriched in the tail of trajectories; and NeuralConf, a lightweight estimator trained on these trajectories, improves confidence-weighted answer aggregation over majority voting and static baselines under a fixed trace budget, using only the scalar confidence sequence without access to questions, text, or hidden states.

Significance. If the separation survives controls for generation artifacts, the result would be significant: it identifies an intrinsic, content-free statistical signal of reasoning correctness already present in standard LLM generation, enabling new inference-time methods that improve upon existing aggregation strategies without external verifiers. The reported correlation between clustering quality and downstream AUC, together with the tail-enrichment observation, would constitute a falsifiable empirical link between trajectory geometry and correctness.

major comments (3)
  1. [Abstract] Abstract: the reported quantitative link between Davies-Bouldin index and AUC for correctness discrimination is presented without any description of the embedding procedure that maps raw confidence sequences to low-dimensional representations, the criterion used to select dimensionality, or statistical controls (e.g., length regression or matched subsampling).
  2. [Abstract] Abstract: the central claim that observed separation reflects trace-level correctness rather than correlated generation properties is load-bearing yet unsupported by the described experiments; incorrect traces systematically differ in length, cumulative entropy, and stopping dynamics, any of which can produce separable patterns in a raw scalar sequence, and no length-matched ablation, token-distribution control, or regression on length is mentioned.
  3. [Abstract] Abstract: the claim that NeuralConf improves aggregation 'under a fixed trace budget' is stated without specifying the training objective, input representation, or whether the model receives only the confidence trajectory (as asserted) versus additional features; this leaves unclear whether the performance gain is attributable to the proposed geometry or to other factors.
minor comments (2)
  1. [Abstract] The abstract introduces 'confidence geometry' and 'NeuralConf' without a concise formal definition or pointer to the section that defines them.
  2. The manuscript should clarify the exact set of LLMs and generation hyperparameters used to produce the confidence trajectories, as these choices affect the generality of the reported separation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful review and for recognizing the potential significance of identifying content-agnostic signals in LLM confidence trajectories. We address each major comment below with clarifications drawn from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported quantitative link between Davies-Bouldin index and AUC for correctness discrimination is presented without any description of the embedding procedure that maps raw confidence sequences to low-dimensional representations, the criterion used to select dimensionality, or statistical controls (e.g., length regression or matched subsampling).

    Authors: We agree that the abstract's brevity omitted these details. The full manuscript (Section 3.2) specifies that raw confidence sequences are embedded via PCA after length normalization and padding, with dimensionality selected by retaining components that explain at least 90% of variance; statistical controls include both length regression and matched subsampling by trace length. In the revision we will add a concise clause to the abstract summarizing the embedding and controls so that the quantitative link is presented with the necessary methodological context. revision: yes

  2. Referee: [Abstract] Abstract: the central claim that observed separation reflects trace-level correctness rather than correlated generation properties is load-bearing yet unsupported by the described experiments; incorrect traces systematically differ in length, cumulative entropy, and stopping dynamics, any of which can produce separable patterns in a raw scalar sequence, and no length-matched ablation, token-distribution control, or regression on length is mentioned.

    Authors: This concern is well-founded and we acknowledge that generation artifacts must be ruled out. The manuscript already reports length-matched subsampling and linear regression of Davies-Bouldin index on length and entropy (Section 4.1 and Appendix B), showing that geometric separation persists after these controls. However, these controls were not referenced in the abstract. We will revise the abstract to explicitly state that the reported correlation survives length-matched ablation and regression on length/entropy, thereby directly addressing the possibility that separation arises from generation properties rather than correctness. revision: yes

  3. Referee: [Abstract] Abstract: the claim that NeuralConf improves aggregation 'under a fixed trace budget' is stated without specifying the training objective, input representation, or whether the model receives only the confidence trajectory (as asserted) versus additional features; this leaves unclear whether the performance gain is attributable to the proposed geometry or to other factors.

    Authors: We accept that the abstract does not fully specify these aspects. NeuralConf is a small MLP trained with binary cross-entropy on correctness labels using only the scalar confidence sequence (padded to fixed length) as input; no text, hidden states, or question features are provided. The fixed-trace-budget experiments compare against majority voting and static baselines under identical generation constraints. In revision we will insert a brief parenthetical in the abstract clarifying that NeuralConf receives solely the confidence trajectory and is trained for binary correctness prediction, making the source of the reported gains explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation is empirically grounded.

full rationale

The paper reports an empirical observation that low-dimensional representations of token-level confidence trajectories separate correct from incorrect traces on GSM8K, MATH, and MMLU, where correctness is determined by external final-answer verification against ground truth. NeuralConf is presented as a lightweight model trained on these trajectories to produce correctness scores, with improvements shown via comparison to majority voting and other baselines under fixed trace budgets. The Davies-Bouldin index to AUC linkage is a measured statistical correlation on benchmark data rather than a definitional or self-referential reduction. No quoted step equates the claimed geometric separation or NeuralConf performance to its inputs by construction, and the method remains falsifiable against external labels without requiring the target result as an assumption.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Because only the abstract is available, the ledger records the minimal assumptions implied by the claims rather than explicit statements from the full text.

free parameters (1)
  • embedding dimensionality
    Low-dimensional representations require choosing a target dimension whose value is not derived from first principles and must be selected to achieve the reported separation.
axioms (1)
  • domain assumption Token-level confidence values are produced by the model and are available as a sequence during generation.
    The entire analysis rests on the existence and accessibility of these per-token scores.
invented entities (1)
  • confidence geometry no independent evidence
    purpose: A low-dimensional structure in the space of confidence trajectories that encodes correctness information.
    The paper introduces this geometric view as the central organizing concept.

pith-pipeline@v0.9.0 · 5756 in / 1439 out tokens · 57326 ms · 2026-05-19T20:50:25.533304+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 7 internal anchors

  1. [1]

    Chain-of- thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of- thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

  2. [2]

    Large language models are zero-shot reasoners,

    T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inAdvances in Neural Information Processing Systems, 2022. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2022/hash/ 8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html

  3. [3]

    Least-to-most prompting enables complex reasoning in large language models,

    D. Zhou, N. Sch ¨arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. V . Le, and E. H. Chi, “Least-to-most prompting enables complex reasoning in large language models,” inInternational Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=WZH7099tgfM

  4. [4]

    Self-consistency improves chain of thought reasoning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inInternational Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=1PL1NIMMrw

  5. [5]

    Tree of thoughts: Deliberate problem solving with large language models,

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” inAdvances in Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id= 5Xc1ecxO1h

  6. [6]

    Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning,

    C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning,” inInternational Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=4FW AwZtd2n

  7. [7]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V . Le, C. R ´e, and A. Mirhoseini, “Large language monkeys: Scaling inference compute with repeated sampling,”arXiv preprint arXiv:2407.21787, 2024. [Online]. Available: https://arxiv.org/abs/2407.21787

  8. [8]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, Ł. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021. [Online]. Available: https: //arxiv.org/abs/2110.14168

  9. [9]

    Let’s verify step by step,

    H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” inInternational Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=v8L0pN6EOi

  10. [10]

    Large language models are better reasoners with self-verification,

    Y . Weng, M. Zhu, F. Xia, B. Li, S. He, S. Liu, B. Sun, K. Liu, and J. Zhao, “Large language models are better reasoners with self-verification,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 2550–2575. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.167/

  11. [11]

    Self-evaluation guided beam search for reasoning,

    Y . Xie, K. Kawaguchi, Y . Zhao, J. X. Zhao, M.-Y . Kan, J. He, and M. Xie, “Self-evaluation guided beam search for reasoning,” inAdvances in Neural Information Processing Systems, 2023. [Online]. Available: https://papers.nips.cc/paper files/paper/2023/hash/ 11 81fde95c4dc79188a69ce5b24d63010b-Abstract-Conference.html

  12. [12]

    A chain-of-thought is as strong as its weakest link: A benchmark for verifiers of reasoning chains,

    A. Jacovi, Y . Bitton, B. Bohnet, J. Herzig, O. Honovich, M. Tseng, M. Collins, R. Aharoni, and M. Geva, “A chain-of-thought is as strong as its weakest link: A benchmark for verifiers of reasoning chains,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 4615–

  13. [13]

    Available: https://aclanthology.org/2024.acl-long.254/

    [Online]. Available: https://aclanthology.org/2024.acl-long.254/

  14. [14]

    Language Models (Mostly) Know What They Know

    S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y . Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. C...

  15. [15]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,

    L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” in International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=VD-AYtP0dve

  16. [16]

    Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms,

    M. Xiong, Z. Hu, X. Lu, Y . Li, J. Fu, J. He, and B. Hooi, “Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms,” inInternational Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/ forum?id=gjeQKFxFpZ

  17. [17]

    Teaching Models to Express Their Uncertainty in Words

    S. Lin, J. Hilton, and O. Evans, “Teaching models to express their uncertainty in words,”arXiv preprint arXiv:2205.14334, 2022. [Online]. Available: https://arxiv.org/abs/2205.14334

  18. [18]

    The internal state of an LLM knows when it’s lying,

    A. Azaria and T. Mitchell, “The internal state of an LLM knows when it’s lying,” inFindings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics, 2023, pp. 967–976. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.68/

  19. [19]

    INSIDE: LLMs’ internal states retain the power of hallucination detection,

    C. Chen, K. Liu, Z. Chen, Y . Gu, Y . Wu, M. Tao, Z. Fu, and J. Ye, “INSIDE: LLMs’ internal states retain the power of hallucination detection,” inInternational Conference on Learning Representations,

  20. [20]

    Available: https://openreview.net/forum?id=Zj12nzlQbz

    [Online]. Available: https://openreview.net/forum?id=Zj12nzlQbz

  21. [21]

    Deep think with confidence,

    Y . Fu, X. Wang, H. Zhang, Y . Tian, and J. Zhao, “Deep think with confidence,” inInternational Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/ forum?id=8LqHs0KIM7

  22. [22]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,”arXiv preprint arXiv:1802.03426, 2018. [Online]. Available: https://arxiv.org/abs/1802. 03426

  23. [23]

    Davies and Donald W

    D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224–227, 1979. [Online]. Available: https://doi.org/10.1109/TPAMI.1979.4766909

  24. [24]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [Online]. Available: https://openaccess.thecvf.com/content cvpr 2016/ html/He Deep Residual Learning CVPR 2016 paper.html

  25. [25]

    IEEE Trans

    H. He and E. A. Garcia, “Learning from imbalanced data,”IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009. [Online]. Available: https://doi.org/10.1109/TKDE. 2008.239

  26. [26]

    Measuring mathematical problem solving with the math dataset,

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. [Online]. Available: https://datasets-benchmarks-proceedings.neurips.cc/paper/ 2021/hash/be83ab3ecd...

  27. [27]

    Measuring massive multitask language understanding,

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=d7KBjmI3GmQ

  28. [28]

    Reclor: A reading comprehension dataset requiring logical reasoning,

    W. Yu, Z. Jiang, Y . Dong, and J. Feng, “Reclor: A reading comprehension dataset requiring logical reasoning,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=HJgJtT4tvB

  29. [29]

    doi: 10.1038/s41586-025-09422-z

    D. Guo, D. Yang, H. Zhanget al., “Deepseek-r1 incentivizes reasoning in LLMs through reinforcement learning,”Nature, vol. 645, no. 8081, pp. 633–638, 2025. [Online]. Available: https: //doi.org/10.1038/s41586-025-09422-z

  30. [30]

    Qwen2.5 Technical Report

    Qwen Team, “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024. [Online]. Available: https://arxiv.org/abs/ 2412.15115

  31. [32]
  32. [33]

    An introduction to ROC analysis

    T. Fawcett, “An introduction to roc analysis,”Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006. [Online]. Available: https://doi.org/10.1016/j.patrec.2005.10.010