arxiv: 2605.13772 · v1 · submitted 2026-05-13 · 💻 cs.CL · cs.AI

Recognition: unknown

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

Tyler Alvarez , Ali Baheri

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords step-level hallucinationhidden-state trajectorytransport geometrycontrastive PCAfirst-error localizationLLM reasoning

0 comments

The pith

Hidden-state trajectories expose the exact step where LLM reasoning first breaks via transport geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that hallucinations arise as localized excursions in hidden-state transport cost away from a stable manifold of coherent reasoning transitions, and that these can be detected at the first error step using only a single forward pass. A label-conditioned teacher builds a contrastive PCA projection to score each step with geometric features measuring deviation from correct trajectories, which is then distilled into a deployable student model operating on raw states. This framing outperforms entropy-based, probing, and attention baselines on ProcessBench, PRM800K, HaluEval, and TruthfulQA. The teacher transfers across models and datasets, while the student does not, consistent with the underlying distillation theory. The work recasts step-level detection as a trajectory-dynamics problem rather than aggregate scoring.

Core claim

We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. Correct reasoning follows a stable manifold of locally coherent transitions; the first error appears as a localized increase in transport cost away from this manifold.

What carries the argument

Contrastive PCA lens built by a label-conditioned teacher that scores steps via seven geometric transition features measuring transport cost deviations from the correct manifold.

If this is right

Single-pass first-error localization succeeds whenever a positive transport margin is present.
The teacher model outperforms entropy, probing, and attention baselines across ProcessBench, PRM800K, HaluEval, and TruthfulQA.
The teacher transfers stably across language models and datasets while the distilled student collapses under shift.
Detection requires only one forward pass and no multiple sampled completions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integration into decoding loops could enable real-time correction at the exact error step.
If the manifold structure generalizes, the same geometry could flag first errors in non-language sequential tasks such as planning or symbolic execution.
Robustness under distribution shift for the student model would require distillation objectives that explicitly preserve the transport margin.

Load-bearing premise

The first error creates a positive transport margin over preceding correct transitions and a stable manifold of locally coherent transitions exists for correct reasoning.

What would settle it

An example where the first error produces no positive transport margin over prior correct steps would falsify the single-pass localization guarantee.

Figures

Figures reproduced from arXiv: 2605.13772 by Ali Baheri, Tyler Alvarez.

**Figure 1.** Figure 1: The GeoReason teacher–student architecture. The teacher (top) uses step-level labels and reasoning-trace hidden states to construct a contrastive PCA (cPCA) projection, extracts a geometric feature set in this lens, and maps the features through an MLP to step-level hallucination probabilities. The student (bottom) is a BiLSTM that contextualizes raw hidden states and feeds a step classifier head, trained … view at source ↗

**Figure 2.** Figure 2: Hidden states projected into cPCA space for (a) correct, (b) first-error, and (c) post-error steps. First-error steps lie largely outside the correct-step distribution (83.9% outside 1σ), while post-error steps partially overlap (41.1%). This supports our view of hallucinations as trajectory deviations from a stable reasoning manifold. on certain datasets such as TruthfulQA. Because the tables report point… view at source ↗

read the original abstract

Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy-based, probing-based, and attention-based baselines in-domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step-level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The transport-geometry framing for single-pass first-error localization is new, but the localization theorem only applies if an unverified positive margin actually appears for most errors.

read the letter

The main takeaway is that this paper recasts step-level hallucination as a transport-cost excursion away from a stable hidden-state manifold, then uses a label-conditioned contrastive-PCA teacher to score steps and distills it to a BiLSTM student that runs on raw states alone. That single-pass, no-sampling angle is genuinely different from the trace-level detectors they cite, and the teacher beats entropy, probing, and attention baselines on ProcessBench, PRM800K, HaluEval, and TruthfulQA while transferring across models. Those empirical points are concrete and worth noting for anyone working on early error detection in reasoning chains. The optimality claim for contrastive PCA under a transport-separation objective also looks internally consistent on its own terms. The soft spot is exactly the one the stress-test flags: the localization guarantee is conditional on the first error producing a positive transport margin over prior correct steps, yet the abstract and available details give no margin histograms, failure-case breakdowns, or checks on subtle errors to show this condition holds on the benchmarks. If the margin is zero or negative for a non-trivial fraction of traces, the theorem does not apply and the gains reduce to another contrastive probe. The student also collapses under shift, which the paper notes but leaves as an open deployment obstacle. This is aimed at researchers in LLM reliability and mechanistic interpretability who need step-level signals without extra sampling. It shows honest engagement with the trajectory-dynamics literature and enough new framing plus results to merit a serious referee, even though revisions will likely focus on verifying the margin assumption and improving transfer.

Referee Report

3 major / 2 minor

Summary. The paper frames step-level hallucination detection in LLMs as analysis of hidden-state trajectories during a single forward pass. Correct reasoning follows a stable manifold of coherent transitions, while the first error appears as an excursion measurable via transport cost. It introduces a label-conditioned teacher that constructs a trace-specific contrastive PCA projection and extracts seven geometric features, plus a BiLSTM student distilled for label-free deployment. The central claims are a proof that contrastive PCA is optimal for a transport-separation objective and a localization theorem that holds whenever the first error produces a positive transport margin over prior correct steps. Experiments on ProcessBench, PRM800K, HaluEval, and TruthfulQA report outperformance versus entropy, probing, and attention baselines, with the teacher transferring across models/datasets while the student does not.

Significance. If the optimality proof and margin-based localization theorem are rigorously established and the positive-margin condition is empirically validated on the target benchmarks, the work would supply a principled geometric account of where reasoning breaks and enable single-pass localization without multiple samples. The teacher-student distillation gap and the explicit identification of distribution-shift fragility as the deployment obstacle are useful contributions. The framing recasts hallucination detection as trajectory dynamics rather than post-hoc scoring.

major comments (3)

[Abstract / theoretical results] Abstract and theoretical results: The localization guarantee is stated to hold 'whenever the first error creates a positive transport margin over preceding correct transitions,' yet no margin histogram, failure-case breakdown, or independent diagnostic is supplied showing that this margin is positive for the majority of first errors on ProcessBench or PRM800K. If the margin is zero or negative for a non-negligible fraction of traces, the theorem does not apply and the reported gains reduce to a standard probing baseline.
[Theoretical results] § on optimality proof: The claim that contrastive PCA is the optimal projection for the transport-separation objective between first-error and correct states requires the full derivation to be checked for hidden assumptions; the abstract alone does not clarify whether the margin quantity is independently grounded or defined circularly by the same projection used for detection.
[Experiments] Experimental section: Outperformance is asserted on four benchmarks without reported error bars, statistical significance tests, or ablation on the seven geometric features; the soundness of the central empirical claim cannot be assessed from the provided details.

minor comments (2)

[Method] Clarify whether the seven geometric transition features remain well-defined and computable when the student model operates without any label-conditioned information at inference time.
[Introduction] Add a short paragraph contrasting the transport-margin condition with existing notions of reasoning coherence or entropy spikes to help readers situate the geometric contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that directly strengthen the empirical validation and presentation of the theoretical results.

read point-by-point responses

Referee: [Abstract / theoretical results] The localization guarantee is stated to hold 'whenever the first error creates a positive transport margin over preceding correct transitions,' yet no margin histogram, failure-case breakdown, or independent diagnostic is supplied showing that this margin is positive for the majority of first errors on ProcessBench or PRM800K. If the margin is zero or negative for a non-negligible fraction of traces, the theorem does not apply and the reported gains reduce to a standard probing baseline.

Authors: We agree that direct empirical validation of the positive-margin condition is required to confirm the theorem's applicability on the reported benchmarks. The localization theorem is explicitly conditional, and while the observed gains over baselines are consistent with a generally positive margin, we did not provide margin distributions or failure breakdowns in the original submission. In the revised manuscript we will add histograms of transport margins for first-error versus correct steps on ProcessBench and PRM800K, include a quantitative breakdown of traces with non-positive margins, and report their frequency together with the resulting impact on detection accuracy. revision: yes
Referee: [Theoretical results] The claim that contrastive PCA is the optimal projection for the transport-separation objective between first-error and correct states requires the full derivation to be checked for hidden assumptions; the abstract alone does not clarify whether the margin quantity is independently grounded or defined circularly by the same projection used for detection.

Authors: The optimality proof appears in the dedicated theoretical section and derives contrastive PCA as the projection that maximizes the transport-separation objective. The margin itself is defined via the transport cost computed in the original hidden-state space prior to any projection; the contrastive PCA step is then shown to preserve this separation. To eliminate any ambiguity about hidden assumptions or potential circularity, we will expand the proof section with a fully explicit step-by-step derivation, enumerate all modeling assumptions (including manifold coherence and the positive-margin precondition), and restate the independence of the margin definition from the learned projection. revision: partial
Referee: [Experiments] Outperformance is asserted on four benchmarks without reported error bars, statistical significance tests, or ablation on the seven geometric features; the soundness of the central empirical claim cannot be assessed from the provided details.

Authors: We accept that the experimental reporting lacks the statistical detail needed for full assessment. The original results present only point estimates. In the revision we will add error bars (standard deviation across random seeds and data splits), perform statistical significance tests (paired t-tests and Wilcoxon signed-rank tests) against all baselines, and include a feature ablation study that isolates the contribution of each of the seven geometric features both individually and in combination. revision: yes

Circularity Check

1 steps flagged

Localization guarantee reduces to positive margin defined by the contrastive-PCA projection itself

specific steps

self definitional [abstract]
"We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions."

The transport margin is defined via the contrastive-PCA projection of hidden-state trajectories; the localization theorem therefore applies exactly when that projection produces positive separation, making the 'whenever' condition equivalent to the success of the method by construction rather than an independent mathematical guarantee.

full rationale

The paper's central theorem states that single-pass first-error localization holds 'whenever the first error creates a positive transport margin over preceding correct transitions.' This margin is obtained from the same contrastive-PCA projection that the paper proves is optimal for the transport-separation objective. Because the margin is computed directly from the projection whose optimality is asserted, the localization claim holds precisely when the projection succeeds in separating the states, rendering the guarantee tautological rather than independently derived. No external diagnostic or independent grounding for the margin is provided in the abstract or claimed derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of a stable manifold of coherent transitions and on the transport margin being positive at the first error; both are domain assumptions not independently evidenced in the abstract.

axioms (2)

domain assumption Correct reasoning produces a stable manifold of locally coherent hidden-state transitions
Invoked in the abstract as the baseline against which excursions are measured.
domain assumption The first error produces a positive transport margin over preceding correct transitions
Stated as the condition under which single-pass localization holds.

pith-pipeline@v0.9.0 · 5549 in / 1320 out tokens · 36318 ms · 2026-05-14T19:11:26.310867+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 9 canonical work pages · 5 internal anchors

[1]

J., Bagaria, V

Abid, A., Zhang, M. J., Bagaria, V . K., and Zou, J. Exploring patterns enriched in a dataset with contrastive principal component analysis.Nature Communications, 9(1):2134, 2018. doi: 10.1038/ s41467-018-04608-8. Article number 2134

2018
[2]

and Baheri, A

Amiri Shahbazi, M. and Baheri, A. Geometry-aware uncertainty quantification via conformal prediction on manifolds.arXiv preprint arXiv:2602.16015, 2026. doi: 10.48550/arXiv.2602.16015

work page doi:10.48550/arxiv.2602.16015 2026
[3]

and Mitchell, T

Azaria, A. and Mitchell, T. The internal state of an LLM knows when it’s lying. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2023, pp. 967–976, Singapore, 2023

2023
[4]

CoRRabs/2602.08734(2026)

Baheri, A. Logic-guided vector fields for con- strained generative modeling.arXiv preprint arXiv:2602.02009, 2026. doi: 10.48550/arXiv.2602. 02009

work page doi:10.48550/arxiv.2602 2026
[5]

and Alm, C

Baheri, A. and Alm, C. O. LLMs-augmented contex- tual bandit. InNeurIPS 2023 Workshop on Foundation Models for Decision Making, 2023. FMDM@NeurIPS 2023

2023
[6]

and Amiri Shahbazi, M

Baheri, A. and Amiri Shahbazi, M. Conformal predic- tion across scales: Finite-sample coverage with hierar- chical efficiency.Results in Applied Mathematics, 26: 100589, 2025. doi: 10.1016/j.rinam.2025.100589

work page doi:10.1016/j.rinam.2025.100589 2025
[7]

and Wei, P

Baheri, A. and Wei, P. Multi-fidelity temporal reason- ing: A stratified logic for cross-scale system spec- ifications.Logics, 3(2):5, 2025. doi: 10.3390/ logics3020005

2025
[8]

Dis- covering latent knowledge in language models without supervision

Burns, C., Ye, H., Klein, D., and Steinhardt, J. Dis- covering latent knowledge in language models without supervision. InInternational Conference on Learning Representations, 2023

2023
[9]

INSIDE: LLMs’ internal states retain the power of hallucination detection

Chen, C., Liu, K., Chen, Z., Gu, Y ., Wu, Y ., Tao, M., Fu, Z., and Ye, J. INSIDE: LLMs’ internal states retain the power of hallucination detection. InInternational Conference on Learning Representations, 2024

2024
[10]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

and Kahan, W

Davis, C. and Kahan, W. M. The rotation of eigenvec- tors by a perturbation. III.SIAM Journal on Numerical Analysis, 7(1):1–46, 1970

1970
[12]

and Ghahramani, Z

Gal, Y . and Ghahramani, Z. Dropout as a Bayesian ap- proximation: Representing model uncertainty in deep learning. InInternational Conference on Machine Learning, pp. 1050–1059, 2016

2016
[13]

Distilling the knowledge in a neural network, 2015

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network, 2015. NIPS Deep Learning and Representation Learning Work- shop, 2015

2015
[14]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2024. Final version: ACM Trans. Inf. Syst. 43(2), Article 42, January 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y ., Ishii, E., Bang, Y ., Madotto, A., and Fung, P. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023. doi: 10.1145/ 3571730

2023
[16]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Semantic uncer- tainty: Linguistic invariances for uncertainty estima- tion in natural language generation

Kuhn, L., Gal, Y ., and Farquhar, S. Semantic uncer- tainty: Linguistic invariances for uncertainty estima- tion in natural language generation. InInternational Conference on Learning Representations, 2023

2023
[18]

Let's Verify Step by Step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Manakul, P., Liusie, A., and Gales, M. J. F. SelfCheck- GPT: Zero-resource black-box hallucination detection for generative large language models. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9004–9017, 2023

2023
[20]

Locating and editing factual associations in GPT

Meng, K., Bau, D., Andonian, A., and Belinkov, Y . Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems, volume 35, pp. 17359–17372, 2022

2022
[21]

and Cuturi, M

Peyr´e, G. and Cuturi, M. Computational optimal trans- port: With applications to data science.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

2019
[22]

Springer, 2009

Villani, C.Optimal Transport: Old and New, vol- ume 338 ofGrundlehren der mathematischen Wis- senschaften. Springer, 2009. 9 Where Does Reasoning Break? Hidden-State Transport Geometry

2009
[23]

Math-shepherd: Ver- ify and reinforce LLMs step-by-step without human annotations

Wang, P., Li, L., Shao, Z., Xu, R., Dai, D., Li, Y ., Chen, D., Wu, Y ., and Sui, Z. Math-shepherd: Ver- ify and reinforce LLMs step-by-step without human annotations. InProceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9426–9439, Bangkok, Thailand, 2024

2024
[24]

Self- consistency improves chain of thought reasoning in language models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self- consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

2023
[25]

V ., and Zhou, D

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V ., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022

2022
[26]

Processbench: Identifying process errors in mathematical reasoning,

Zheng, C., Zhang, Z., Zhang, B., Lin, R., Lu, K., Yu, B., Liu, D., Zhou, J., and Lin, J. ProcessBench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559, 2024

work page arXiv 2024
[27]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation en- gineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023. 1...

work page internal anchor Pith review Pith/arXiv arXiv 2023