Reading Calibrated Uncertainty from Language Model Trajectories

Alexander Herzog; Aliai Eusebi; Enrico Mariconti; Lorenzo Cavallaro; Marie Vasek; Xiaoyu Liang

arxiv: 2605.22864 · v1 · pith:T4JZAZDQnew · submitted 2026-05-19 · 💻 cs.LG

Reading Calibrated Uncertainty from Language Model Trajectories

Aliai Eusebi , Alexander Herzog , Xiaoyu Liang , Marie Vasek , Enrico Mariconti , Lorenzo Cavallaro This is my paper

Pith reviewed 2026-05-25 05:42 UTC · model grok-4.3

classification 💻 cs.LG

keywords uncertainty quantificationlanguage modelslinear probesgeometric featuresmodel trajectoriesselective abstentioncalibrationinternal activations

0 comments

The pith

Eleven geometric features from language model layer trajectories let a linear probe read uncertainty better than maximum softmax probability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models form representations through sequences of updates across layers, and similar final outputs can arise from paths that differ in how evidence accumulates or reverses. This paper extracts eleven scale-invariant geometric features that describe the cumulative trajectory of per-layer MLP updates and feeds them to a sparse linear probe. The probe improves upon the maximum softmax probability when deciding whether to abstain from a prediction, with gains that grow as baseline miscalibration increases, reaching up to 21 AURC points. The features carry closed-form geometric meanings, so the probe coefficients directly indicate the depths at which the model commits to errors or drifts from its endpoint.

Core claim

By tracing the path of per-layer MLP updates with eleven scale-invariant geometric features and feeding them to a sparse linear probe, uncertainty estimates can be obtained that outperform those from the maximum softmax probability under selective abstention, improving by up to 21 AURC points, and the probe coefficients indicate the depth at which errors take shape.

What carries the argument

Eleven scale-invariant geometric features extracted from the cumulative paths of per-layer MLP updates, used as input to a sparse linear probe for uncertainty estimation.

If this is right

The probe coefficients identify specific layers where the model commits prematurely or produces contradictions.
Improvements are largest precisely when the maximum softmax probability is most miscalibrated.
The same trajectory features apply across different models and structured output formats.
Uncertainty quantification can shift from endpoint probabilities to the full internal path.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Trajectory features could be extracted from attention or other layer types to capture additional signals.
The approach might support interventions that adjust representations at the layers where drift is detected.
If the geometric features prove highly predictive, they could reduce reliance on learned probes in favor of direct calculations.
Similar path analysis could extend to non-language sequence models that use layered updates.

Load-bearing premise

The eleven geometric features from per-layer MLP update paths contain information about prediction correctness beyond what the final maximum softmax probability already captures, and this relationship holds across models and output structures.

What would settle it

If the probe shows no improvement over MSP or if gains fail to scale with baseline miscalibration on new models and tasks, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2605.22864 by Alexander Herzog, Aliai Eusebi, Enrico Mariconti, Lorenzo Cavallaro, Marie Vasek, Xiaoyu Liang.

**Figure 1.** Figure 1: Cumulative MLP write-vectors traced layer-wise. The trajectory-geometry separates the two populations. Correct trajectories converge to uˆ; errored trajectories curve, double back, and drift. selecting the setting with lowest validation AURC. The selected hyperparameters are refit on train+validation before evaluation on the held-out test fold. 4. Experimental Setup Models. We conduct the experiments usi… view at source ↗

**Figure 4.** Figure 4: Two Qwen2.5-14B predictions on MMLU at MSP ≈ 0.97, one correct and one incorrect. Layer-wise z-scores of the eleven trajectory features across normalized depth for the correct prediction (top) and incorrect (bottom). Probe performance. Our probe improves over MSP on 41 of 45 model–dataset pairs, with gains largest where MSP performs worst (Spearman ρ = 0.78). For the five configurations with MSP AURC abov… view at source ↗

**Figure 5.** Figure 5: Probe coefficient composition across models, showing coefficient mass by geometric feature family (color), split into main effects (solid) and MSP×trajectory interactions (hatched). 6 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 2.** Figure 2: Risk-coverage curves for Qwen2.5-72B across the five evaluation datasets. An ideal curve is monotonically non-decreasing as predictions are rejected in order of decreasing confidence, approaching zero risk at low coverage and the base error rate at full coverage. The MSP baseline is shown in dashed blue and our probe in solid green; the shaded region between them indicates the probe’s gain [PITH_FULL_IMAG… view at source ↗

**Figure 3.** Figure 3: Probe and MSP AUROC across confidence bins, aggregated over models. Lines show mean AUROC, bands the interquartile range, and dots per-model values. Green shading indicates bins where the probe outperforms MSP [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 6.** Figure 6: Median probe coefficients per (feature, depth-bin) cell for Qwen2.5-14B on HaluSum (top) and CosmosQA (bottom). Left: direct effects; right: MSP interactions. Pink increases predicted error probability, green decreases it. Values are normalized by the dataset-specific 95th-percentile magnitude. The coefficient maps in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Aggregate attribution maps for Llama-3.2-3B-Instruct on HaluSum (left) and HaluDial (right), averaged over the 100 MSP-matched probe-correct pairs with the largest error-probability gaps. Maps share a normalized-depth axis and color scale. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

The maximum softmax probability (MSP) represents a default approach when evaluating uncertainty quantification for language model generation with structured output. Although cheap, it is often miscalibrated. Methods that probe the model's internal activations feed raw hidden states into opaque classifiers, reading activations as static snapshots and leaving implicit the layer-wise trajectory by which a representation is formed. Yet, similar endpoints can arise from very different paths, and how evidence accumulates, reinforces, or reverses across depth might reveal uncertainty that final probabilities obscure. We extract eleven scale-invariant geometric features, tracing the cumulative path of per-layer MLP updates, and feed them to a sparse linear probe. The probe outperforms MSP under selective abstention, with gains scaling with baseline miscalibration up to 21 AURC points. Because every feature has a closed-form geometric meaning, the probe's coefficients trace how and where along depth errors take shape -- which layers commit prematurely, which contradict the running state, where trajectories drift away from their endpoint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that eleven scale-invariant geometric features extracted from per-layer MLP update trajectories in language models can be fed to a sparse linear probe to yield better uncertainty estimates than the maximum softmax probability (MSP). The probe improves selective abstention performance (AURC gains up to 21 points, scaling with baseline miscalibration), and the closed-form geometric meaning of the features permits interpretation of probe coefficients to locate where along depth errors form.

Significance. If the reported gains and orthogonality to MSP hold under the stated conditions, the work supplies an interpretable, low-cost-at-inference alternative to opaque activation probes for LLM uncertainty. The emphasis on scale-invariant, geometrically defined features and the explicit tracing of error accumulation across layers constitute a clear methodological contribution that could be tested on additional model families and output structures.

minor comments (2)

Abstract: the acronym AURC is introduced without expansion or reference; a parenthetical definition or citation on first use would improve accessibility for readers outside selective-abstention literature.
Abstract: the phrase 'structured output' is used without enumerating the concrete tasks or output formats (e.g., classification, generation with constrained decoding) on which the eleven features were evaluated.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. The referee summary accurately reflects the manuscript's claims and contributions on scale-invariant geometric features from per-layer MLP trajectories for uncertainty quantification.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper extracts eleven scale-invariant geometric features from per-layer MLP update trajectories and trains a sparse linear probe on them to improve uncertainty quantification over MSP. This is a standard supervised empirical pipeline: features are computed from model internals, the probe is fit on labeled data, and performance is measured via AURC on selective abstention. No equation or claim reduces the reported gains to a quantity defined by the same inputs (no self-definitional loop, no fitted parameter renamed as prediction). The geometric features have explicit closed-form meanings independent of the probe coefficients. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is present in the provided text. The method is self-contained once the probe is trained; orthogonality to MSP is an empirical claim, not a definitional one.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that geometric path features encode uncertainty information beyond final probabilities; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Scale-invariant geometric features extracted from per-layer MLP update trajectories capture uncertainty relevant to correctness that is not present in the final MSP.
This premise is required for the probe to provide gains over the baseline and is stated as the motivation in the abstract.

pith-pipeline@v0.9.0 · 5708 in / 1255 out tokens · 21888 ms · 2026-05-25T05:42:36.410331+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extract eleven scale-invariant geometric features, tracing the cumulative path of per-layer MLP updates... Relative update magnitude ∥m_ℓ∥/n̄, Cumulative path fraction, Consecutive cosine, Curvature, Update–state alignment, Direction to final, Update to final, Cumulative coherence ∥s_ℓ∥/∑∥m_k∥
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The probe outperforms MSP under selective abstention, with gains scaling with baseline miscalibration up to 21 AURC points.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 12 internal anchors

[1]

ArXiv , year=

Qwen2.5 Technical Report , author=. ArXiv , year=

work page
[2]

, title =

Guo, Chuan and Pleiss, Geoff and Sun, Yu and Weinberger, Kilian Q. , title =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =. 2017 , publisher =

work page 2017
[3]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Yi: Open Foundation Models by 01.AI

Yi: Open foundation models by 01. ai , author=. arXiv preprint arXiv:2403.04652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Advances in Neural Information Processing Systems , volume=

Benchmarking llms via uncertainty quantification , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[8]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

work page 2023
[9]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[10]

Proceedings of the AAAI conference on artificial intelligence , volume=

Obtaining well calibrated probabilities using bayesian binning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[11]

arXiv preprint arXiv:2305.19187 , year=

Generating with confidence: Uncertainty quantification for black-box large language models , author=. arXiv preprint arXiv:2305.19187 , year=

work page arXiv
[12]

arXiv preprint arXiv:2305.18404 , year=

Conformal prediction with large language models for multi-choice question answering , author=. arXiv preprint arXiv:2305.18404 , year=

work page arXiv
[13]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Uncertainty quantification and confidence calibration in large language models: A survey , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

work page
[14]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. arXiv preprint arXiv:2302.09664 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Transactions of the Association for Computational Linguistics , volume=

Benchmarking uncertainty quantification methods for large language models with lm-polygraph , author=. Transactions of the Association for Computational Linguistics , volume=

work page
[16]

Language models as knowledge bases? , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019
[17]

Understanding intermediate layers using linear classifier probes

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Findings of the association for computational linguistics: EMNLP 2024 , pages=

Internalinspector i2: Robust confidence estimation in llms through internal states , author=. Findings of the association for computational linguistics: EMNLP 2024 , pages=

work page 2024
[19]

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Semantic entropy probes: Robust and cheap hallucination detection in llms , author=. arXiv preprint arXiv:2406.15927 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2506.08572 , year=

The Geometries of Truth Are Orthogonal Across Tasks , author=. arXiv preprint arXiv:2506.08572 , year=

work page arXiv
[22]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=

work page
[23]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Active learning principles for in-context learning with large language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[24]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

A baseline for detecting misclassified and out-of-distribution examples in neural networks , author=. arXiv preprint arXiv:1610.02136 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[26]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=

work page
[27]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Mechanistic understanding and mitigation of language model non-factual hallucinations , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024
[28]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Semantic volume: Quantifying and detecting both external and internal uncertainty in llms , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[29]

Discovering Latent Knowledge in Language Models Without Supervision

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=

Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=

work page
[31]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

Regularization and variable selection via the elastic net , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2005 , publisher=

work page 2005
[32]

International Conference on Learning Representations , year=

Bias-Reduced Uncertainty Estimation for Deep Neural Classifiers , author=. International Conference on Learning Representations , year=

work page
[33]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

The internal state of an LLM knows when it’s lying , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[34]

arXiv preprint arXiv:2404.15993 , year=

Uncertainty estimation and quantification for llms: A simple supervised approach , author=. arXiv preprint arXiv:2404.15993 , year=

work page arXiv
[35]

Advances in neural information processing systems , volume=

What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , volume=

work page
[36]

Language Models (Mostly) Know What They Know

Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Nature , volume=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , volume=. 2024 , publisher=

work page 2024
[38]

arXiv preprint arXiv:2510.04108 , year=

Can Linear Probes Measure LLM Uncertainty? , author=. arXiv preprint arXiv:2510.04108 , year=

work page arXiv
[39]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

Geifman, Yonatan and El-Yaniv, Ran , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

work page 2017
[40]

IEEE Transactions on information theory , volume=

On optimum recognition error and reject tradeoff , author=. IEEE Transactions on information theory , volume=. 1970 , publisher=

work page 1970
[41]

Cosmos QA: Machine reading comprehension with contextual commonsense reasoning , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019
[42]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Hellaswag: Can a machine really finish your sentence? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

work page
[43]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Halueval: A large-scale hallucination evaluation benchmark for large language models , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

work page 2023
[44]

Bias-Reduced Uncertainty Estimation for Deep Neural Classifiers

Bias-reduced uncertainty estimation for deep neural classifiers , author=. arXiv preprint arXiv:1805.08206 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

ArXiv , year=

Qwen2.5 Technical Report , author=. ArXiv , year=

work page

[2] [2]

, title =

Guo, Chuan and Pleiss, Geoff and Sun, Yu and Weinberger, Kilian Q. , title =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =. 2017 , publisher =

work page 2017

[3] [3]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Yi: Open Foundation Models by 01.AI

Yi: Open foundation models by 01. ai , author=. arXiv preprint arXiv:2403.04652 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Advances in Neural Information Processing Systems , volume=

Benchmarking llms via uncertainty quantification , author=. Advances in Neural Information Processing Systems , volume=

work page

[7] [7]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009

[8] [8]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

work page 2023

[9] [9]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017

[10] [10]

Proceedings of the AAAI conference on artificial intelligence , volume=

Obtaining well calibrated probabilities using bayesian binning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[11] [11]

arXiv preprint arXiv:2305.19187 , year=

Generating with confidence: Uncertainty quantification for black-box large language models , author=. arXiv preprint arXiv:2305.19187 , year=

work page arXiv

[12] [12]

arXiv preprint arXiv:2305.18404 , year=

Conformal prediction with large language models for multi-choice question answering , author=. arXiv preprint arXiv:2305.18404 , year=

work page arXiv

[13] [13]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Uncertainty quantification and confidence calibration in large language models: A survey , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

work page

[14] [14]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. arXiv preprint arXiv:2302.09664 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Transactions of the Association for Computational Linguistics , volume=

Benchmarking uncertainty quantification methods for large language models with lm-polygraph , author=. Transactions of the Association for Computational Linguistics , volume=

work page

[16] [16]

Language models as knowledge bases? , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019

[17] [17]

Understanding intermediate layers using linear classifier probes

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Findings of the association for computational linguistics: EMNLP 2024 , pages=

Internalinspector i2: Robust confidence estimation in llms through internal states , author=. Findings of the association for computational linguistics: EMNLP 2024 , pages=

work page 2024

[19] [19]

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Semantic entropy probes: Robust and cheap hallucination detection in llms , author=. arXiv preprint arXiv:2406.15927 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

arXiv preprint arXiv:2506.08572 , year=

The Geometries of Truth Are Orthogonal Across Tasks , author=. arXiv preprint arXiv:2506.08572 , year=

work page arXiv

[22] [22]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=

work page

[23] [23]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Active learning principles for in-context learning with large language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023

[24] [24]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

A baseline for detecting misclassified and out-of-distribution examples in neural networks , author=. arXiv preprint arXiv:1610.02136 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021

[26] [26]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=

work page

[27] [27]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Mechanistic understanding and mitigation of language model non-factual hallucinations , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024

[28] [28]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Semantic volume: Quantifying and detecting both external and internal uncertainty in llms , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[29] [29]

Discovering Latent Knowledge in Language Models Without Supervision

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=

Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=

work page

[31] [31]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

Regularization and variable selection via the elastic net , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2005 , publisher=

work page 2005

[32] [32]

International Conference on Learning Representations , year=

Bias-Reduced Uncertainty Estimation for Deep Neural Classifiers , author=. International Conference on Learning Representations , year=

work page

[33] [33]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

The internal state of an LLM knows when it’s lying , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023

[34] [34]

arXiv preprint arXiv:2404.15993 , year=

Uncertainty estimation and quantification for llms: A simple supervised approach , author=. arXiv preprint arXiv:2404.15993 , year=

work page arXiv

[35] [35]

Advances in neural information processing systems , volume=

What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , volume=

work page

[36] [36]

Language Models (Mostly) Know What They Know

Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Nature , volume=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , volume=. 2024 , publisher=

work page 2024

[38] [38]

arXiv preprint arXiv:2510.04108 , year=

Can Linear Probes Measure LLM Uncertainty? , author=. arXiv preprint arXiv:2510.04108 , year=

work page arXiv

[39] [39]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

Geifman, Yonatan and El-Yaniv, Ran , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

work page 2017

[40] [40]

IEEE Transactions on information theory , volume=

On optimum recognition error and reject tradeoff , author=. IEEE Transactions on information theory , volume=. 1970 , publisher=

work page 1970

[41] [41]

Cosmos QA: Machine reading comprehension with contextual commonsense reasoning , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019

[42] [42]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Hellaswag: Can a machine really finish your sentence? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

work page

[43] [43]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Halueval: A large-scale hallucination evaluation benchmark for large language models , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

work page 2023

[44] [44]

Bias-Reduced Uncertainty Estimation for Deep Neural Classifiers

Bias-reduced uncertainty estimation for deep neural classifiers , author=. arXiv preprint arXiv:1805.08206 , year=

work page internal anchor Pith review Pith/arXiv arXiv