arxiv: 2406.15927 · v1 · pith:J3PD6HKJnew · submitted 2024-06-22 · 💻 cs.CL · cs.AI· cs.LG

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Jannik Kossen , Jiatong Han , Muhammed Razzak , Lisa Schut , Shreshth Malik , Yarin Gal This is my paper

Pith reviewed 2026-05-18 00:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords semantic entropyhallucination detectionuncertainty quantificationlarge language modelsprobing methodshidden states

0 comments

The pith

Semantic entropy probes detect hallucinations in LLMs using hidden states from a single generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces semantic entropy probes to approximate semantic entropy without multiple generations. This approach reduces the computational overhead of uncertainty quantification to nearly zero. It maintains strong performance in detecting hallucinations and generalizes better to new data distributions than prior methods. A sympathetic reader would care because it makes reliable hallucination detection practical for real-world LLM use.

Core claim

Semantic entropy probes (SEPs) are simple classifiers trained to estimate semantic entropy directly from the hidden states of one model generation, retaining high hallucination detection performance and better out-of-distribution generalization than accuracy-based probes.

What carries the argument

Semantic entropy probes, trained on hidden states to recover semantic entropy estimates from single generations.

If this is right

Hallucination detection no longer requires sampling 5-10 generations at test time.
Uncertainty quantification overhead drops to almost zero after probe training.
Probes generalize better to out-of-distribution data compared to direct accuracy prediction methods.
Model hidden states at certain layers and token positions encode semantic entropy information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If hidden states capture semantic entropy this way, similar probes might work for other forms of uncertainty.
This could enable always-on uncertainty monitoring in production LLM systems without extra compute.
Insights from ablations on layers and positions might guide more efficient model designs for interpretability.

Load-bearing premise

Hidden states from a single generation hold sufficient information about semantic entropy for a simple probe to recover it accurately across tasks and models.

What would settle it

Training a probe on single-generation hidden states and finding it does not predict semantic entropy values computed from multiple samples on new tasks.

read the original abstract

We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in Large Language Models (LLMs). Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. (2024) proposes semantic entropy (SE), which can detect hallucinations by estimating uncertainty in the space semantic meaning for a set of model generations. However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption. To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation. SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero. We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy. Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEPs make semantic entropy for hallucination detection cheap by training a probe on single-generation hidden states, but the approximation may still rest on local signals rather than the full meaning distribution.

read the letter

The main takeaway is that this paper gives a practical route to semantic entropy without the usual sampling overhead. They train a lightweight probe on the hidden states from one forward pass to stand in for the entropy over meaning clusters that Farquhar et al. compute from multiple generations. That drops the cost to near zero at test time while keeping hallucination detection performance close to the original method and improving out-of-distribution behavior compared with probes that just predict accuracy directly. The ablations on which layers and token positions carry the signal are a useful addition and show the authors thought about where the information lives in the model. Credit to them for shipping a concrete, deployable approximation instead of another theoretical tweak. The soft spot is exactly the one the stress-test note flags. A single generation's hidden state is conditioned on one sampled sequence, so nothing in the architecture forces it to encode the breadth of the semantic posterior. If the probe is mostly picking up surface features like low-probability tokens or atypical embeddings, the hallucination detection numbers could look good while the link to true semantic entropy stays loose. The abstract claims the hidden states capture SE and the OOD gains support that, but without seeing the full controls, error bars, and checks against simpler baselines, it is hard to rule out a spurious correlation. The training data comes from semantic entropy calculations, which adds some fitting but is acceptable for an empirical approximation. This is for groups already working on uncertainty quantification in LLMs who need something that runs at scale. Readers who know the Farquhar paper will see the direct connection and get the most from the ablations. It deserves peer review. The idea is concrete, the empirical claims are testable, and the practical payoff is clear enough that referees should weigh in on the faithfulness of the approximation and the strength of the controls.

Referee Report

3 major / 2 minor

Summary. The paper proposes Semantic Entropy Probes (SEPs), simple trained probes that approximate semantic entropy (SE) directly from the hidden states of a single LLM generation. This enables cheap hallucination detection without sampling multiple generations at test time, while claiming to retain high performance and exhibit better out-of-distribution generalization than prior probes that directly predict model accuracy. Results are reported across models and tasks, with ablations on token positions and layers.

Significance. If the empirical results hold under closer scrutiny, the work is significant for making semantic uncertainty quantification practical in deployment settings. Reducing the 5-10x compute overhead of SE to near-zero while preserving hallucination detection performance, and especially the reported OOD gains, would be a useful advance over both sampling-based SE and accuracy-probing baselines. The suggestion that hidden states implicitly capture semantic entropy is a falsifiable claim with potential for follow-on work.

major comments (3)

[Results / Ablations] The central claim that SEPs recover semantic entropy (rather than surface-level uncertainty signals) from a single generation's hidden states is load-bearing but under-supported. The skeptic concern lands: nothing in the architecture guarantees that activations encode the breadth of the semantic distribution over meaning clusters instead of local features of the realized sequence. The manuscript would benefit from an explicit control (e.g., comparing probe performance against a baseline that only uses token probabilities or embedding norms) in the main results or ablations section.
[Experimental Setup] Soundness is limited by missing experimental details. The abstract states retained performance and better OOD generalization, yet the text provides neither error bars across runs, full hyperparameter tables, nor complete ablation results on layer/token choices. Without these, it is difficult to assess whether the reported gains are robust or whether the probe is simply fitting to the SE-derived training labels.
[Methods] The training procedure introduces a mild circularity risk that should be quantified. Because SEPs are supervised on labels derived from multi-sample semantic entropy calculations, it is unclear how much of the test-time performance reflects genuine approximation from hidden states versus leakage of the original SE signal through the training distribution. A leave-one-task-out or cross-model transfer experiment would clarify this.

minor comments (2)

[Methods] Notation for the probe architecture (linear vs. shallow MLP) and the exact layer/token position chosen for probing should be stated more explicitly in the main text rather than deferred to the appendix.
[Abstract] The abstract mentions 'across models and tasks' but does not list the specific models, datasets, or number of runs; adding a compact table or sentence would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity, rigor, and support for our claims.

read point-by-point responses

Referee: The central claim that SEPs recover semantic entropy (rather than surface-level uncertainty signals) from a single generation's hidden states is load-bearing but under-supported. The skeptic concern lands: nothing in the architecture guarantees that activations encode the breadth of the semantic distribution over meaning clusters instead of local features of the realized sequence. The manuscript would benefit from an explicit control (e.g., comparing probe performance against a baseline that only uses token probabilities or embedding norms) in the main results or ablations section.

Authors: We agree that an explicit control experiment would provide stronger evidence that SEPs capture semantic entropy information rather than simpler surface-level signals. In the revised manuscript we have added a baseline probe trained directly on token probabilities and embedding norms extracted from the single generation. Results show that SEPs consistently outperform this baseline on hallucination detection across models and tasks, supporting the claim that hidden states encode semantic-level information beyond local sequence features. We have placed these comparisons in the main results and expanded the discussion of what this implies for the information captured by the probes. revision: yes
Referee: Soundness is limited by missing experimental details. The abstract states retained performance and better OOD generalization, yet the text provides neither error bars across runs, full hyperparameter tables, nor complete ablation results on layer/token choices. Without these, it is difficult to assess whether the reported gains are robust or whether the probe is simply fitting to the SE-derived training labels.

Authors: We acknowledge that additional experimental details are necessary for assessing robustness. The revised manuscript now includes error bars computed over five independent runs with different random seeds for all main results. A complete hyperparameter table (including probe architecture, learning rate, batch size, and regularization) has been added to the appendix. We have also expanded the ablation section to report performance for every combination of layer and token position, together with statistical comparisons. These additions allow readers to evaluate whether gains are stable and whether the probe is overfitting to the training labels. revision: yes
Referee: The training procedure introduces a mild circularity risk that should be quantified. Because SEPs are supervised on labels derived from multi-sample semantic entropy calculations, it is unclear how much of the test-time performance reflects genuine approximation from hidden states versus leakage of the original SE signal through the training distribution. A leave-one-task-out or cross-model transfer experiment would clarify this.

Authors: We agree that quantifying potential leakage from the multi-sample SE labels is valuable. In the revision we have added leave-one-task-out experiments in which the probe is trained on all tasks except one and evaluated on the held-out task. Performance remains competitive with the original in-distribution results, indicating that SEPs learn generalizable mappings from hidden states rather than merely memorizing task-specific SE patterns. We have also expanded the existing cross-model transfer results and included a brief analysis of how much performance drops when the training and test distributions differ more substantially. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probe training on independently computed targets

full rationale

The paper defines semantic entropy via prior work (Farquhar et al. 2024) and trains a probe to approximate that quantity from single-generation hidden states. This is a standard supervised regression setup whose targets are computed separately from multiple samples; the probe weights are fitted on held-out data and evaluated on hallucination detection metrics. No equation reduces the target to the probe output by construction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; therefore free parameters, axioms, and invented entities are inferred at a high level from the described method.

free parameters (1)

probe weights
The semantic entropy probe is trained on hidden-state features, implying fitted parameters whose values are not reported in the abstract.

axioms (1)

domain assumption Hidden states of a single generation encode semantic entropy information
Central premise required for the probe to recover semantic entropy without multiple samples.

pith-pipeline@v0.9.0 · 5753 in / 1085 out tokens · 29197 ms · 2026-05-18T00:47:49.686767+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Inducing Artificial Uncertainty in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
cs.AI 2026-05 unverdicted novelty 7.0

Transformer hidden states encode facts as attractor basins; hallucinations occur from basin absence and conflicts from basin competition, detected cleanly by geometric margin rather than entropy.
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
cs.AI 2026-05 unverdicted novelty 7.0

Attractor basins in transformer hidden states unify conflict and hallucination as basin competition or absence, with geometric margin outperforming entropy for detection and a scaling law governing confident hallucina...
Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs
cs.SE 2025-09 unverdicted novelty 7.0

Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to pr...
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
cs.AI 2026-05 unverdicted novelty 6.0

Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
cs.CL 2026-05 unverdicted novelty 6.0

Probabilistic circuits detect LLM hallucinations as residual-stream anomalies with up to 99% AUROC and enable dynamic correction that raises truthfulness scores while cutting unnecessary output corruption.
Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
cs.AI 2026-05 conditional novelty 6.0

Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant ...
To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
cs.AI 2026-05 unverdicted novelty 6.0

LLMs often misalign their self-perceived need for tools with true need and utility, but lightweight estimators trained on hidden states can improve tool-calling decisions and task performance across multiple models and tasks.
The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive
cs.CR 2026-04 unverdicted novelty 6.0

LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
Convergent Evolution: How Different Language Models Learn Similar Number Representations
cs.CL 2026-04 unverdicted novelty 6.0

Diverse language models converge on similar periodic number features with a two-tier hierarchy of Fourier sparsity and geometric separability, acquired via language co-occurrences or multi-token arithmetic.
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation
cs.LG 2026-04 unverdicted novelty 6.0

Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
cs.CL 2026-04 unverdicted novelty 6.0

SHADE adaptively combines coverage and spectral signals to estimate semantic alphabet size from few LLM samples, yielding better performance than baselines in low-sample regimes for alphabet estimation and QA error detection.
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
cs.AI 2026-04 unverdicted novelty 6.0

Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs
cs.CL 2026-03 unverdicted novelty 6.0

Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.
High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models
cs.CV 2025-12 unverdicted novelty 6.0

High-entropy tokens act as concentrated multimodal failure points in VLMs, enabling sparse Entropy-Guided Attacks that achieve 93-95% success and 30-38% harmful rates with cross-model transfer.
Large Lemma Miners: Can LLMs do Induction Proofs for Hardware?
cs.LO 2025-11 conditional novelty 6.0

A neurosymbolic method using two LLM prompting frameworks generates provably correct inductive arguments for 84% of a set of mid-size open-source RTL hardware designs.
GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models
cs.CL 2025-09 unverdicted novelty 6.0

GrACE is a fine-tuned generative method that uses similarity to a special token embedding for real-time calibrated confidence in LLMs and enables efficient confidence-based test-time scaling.
Chain-of-Thought as a Lens: Evaluating Structured Reasoning Alignment between Human Preferences and Large Language Models
cs.AI 2025-11 unverdicted novelty 5.0

The Alignment Score quantifies semantic divergence between model-generated and human-preferred reasoning chains and correlates with accuracy, readability, and coherence.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 20 Pith papers · 13 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C. C. T., Chen, W., Chaudhary, V ., Chopra, P., Giorno, A. D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Agrawal, A., Mackey, L., and Kalai, A. T. Do language models know when they’re hallucinating references? In EACL, 2024

work page 2024
[3]

and Bengio, Y

Alain, G. and Bengio, Y . Understanding intermediate layers using linear classifier probes. InICLR, 2017

work page 2017
[4]

and Mitchell, T

Azaria, A. and Mitchell, T. The internal state of an llm knows when it’s lying. In EMNLP, 2023. 10

work page 2023
[5]

Linguistic calibration of language models

Band, N., Li, X., Ma, T., and Hashimoto, T. Linguistic calibration of language models. arXiv:2404.00474, 2024

work page arXiv 2024
[6]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 2021

Belinkov, Y . Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 2021

work page 2021
[7]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., and Steinhardt, J. Eliciting latent predictions from transformers with the tuned lens. arXiv 2303.08112, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. NeurIPS, 2020

work page 2020
[9]

Discovering latent knowledge in language models without supervision

Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. In ICLR, 2023

work page 2023
[10]

Cao, M., Dong, Y ., and Cheung, J. C. K. Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In ACL, 2022

work page 2022
[11]

and Mueller, J

Chen, J. and Mueller, J. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness. arXiv 2308.16175, 2023

work page arXiv 2023
[12]

Dola: Decoding by contrasting layers improves factuality in large language models

Chuang, Y .-S., Xie, Y ., Luo, H., Kim, Y ., Glass, J., and He, P. Dola: Decoding by contrasting layers improves factuality in large language models. In ICLR, 2024

work page 2024
[13]

R., Zhang, M

Cole, J. R., Zhang, M. J., Gillick, D., Eisenschlos, J. M., Dhingra, B., and Eisenstein, J. Selectively answering ambiguous questions. EMNLP, 2023

work page 2023
[14]

Towards question-answering as an automatic metric for evaluating the content quality of a summary

Deutsch, D., Bedrax-Weiss, T., and Roth, D. Towards question-answering as an automatic metric for evaluating the content quality of a summary. TACL, 2021

work page 2021
[15]

Chain-of-Verification Reduces Hallucination in Large Language Models

Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., and Weston, J. Chain-of- verification reduces hallucination in large language models. arXiv:2309.11495, 2023

work page internal anchor Pith review arXiv 2023
[16]

Shifting attention to relevance: Towards the uncertainty estimation of large language models

Duan, J., Cheng, H., Wang, S., Wang, C., Zavalny, A., Xu, R., Kailkhura, B., and Xu, K. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv:2307.01379, 2023

work page arXiv 2023
[17]

Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization

Durmus, E., He, H., and Diab, M. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. ACL, 2020

work page 2020
[18]

Dziri, N., Madotto, A., Zaïane, O., and Bose, A. J. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In EMNLP, 2021

work page 2021
[19]

Halo: Estimation and reduction of hallucinations in open-source weak large language models

Elaraby, M., Lu, M., Dunn, J., Zhang, X., Wang, Y ., and Liu, S. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv:2308.11764, 2023

work page arXiv 2023
[21]

Detecting Hallucinations in Large Language Models Using Semantic Entropy

Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y . Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 2024

work page 2024
[22]

R., and Pan, S

Feldman, P., Foulds, J. R., and Pan, S. Trapping llm hallucinations using tagged context prompts. arXiv:2306.06085, 2023

work page arXiv 2023
[23]

Controlled hallucinations: Learning to generate faithfully from noisy data

Filippova, K. Controlled hallucinations: Learning to generate faithfully from noisy data. In EMNLP, 2020

work page 2020
[24]

T., Fan, Y ., Zhao, V

Gao, L., Dai, Z., Pasupat, P., Chen, A., Chaganty, A. T., Fan, Y ., Zhao, V . Y ., Lao, N., Lee, H., Juan, D.-C., et al. Rarr: Researching and revising what language models say, using language models. In ACL, 2022

work page 2022
[25]

Deberta: Decoding-enhanced bert with disentangled attention

He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. In ICLR, 2021

work page 2021
[26]

Z., and Andreas, J

Hernandez, E., Li, B. Z., and Andreas, J. Measuring and manipulating knowledge representations in language models. arXiv:2304.00740, 2023

work page arXiv 2023
[27]

J., Madotto, A., and Fung, P

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y ., Ishii, E., Bang, Y . J., Madotto, A., and Fung, P. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

work page 2023
[28]

Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b.arXiv, 2023. 11

work page 2023
[29]

S., and Zettlemoyer, L

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. ACL, 2017

work page 2017
[30]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Dodds, Z. H., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know.arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Kuhn, L., Gal, Y ., and Farquhar, S. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In ICLR, 2023

work page 2023
[32]

N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. Natural questions: a benchmark for question answering research. TACL, 2019

work page 2019
[33]

Measuring Faithfulness in Chain-of-Thought Reasoning

Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., et al. Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

N., Shoeybi, M., and Catanzaro, B

Lee, N., Ping, W., Xu, P., Patwary, M., Fung, P. N., Shoeybi, M., and Catanzaro, B. Factuality enhanced language models for open-ended text generation. NeurIPS, 2022

work page 2022
[35]

Inference-time intervention: Eliciting truthful answers from a language model

Li, K., Patel, O., Viégas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful answers from a language model. NeurIPS, 36, 2024

work page 2024
[36]

K., Ding, B., Joty, S., Poria, S., and Bing, L

Li, X., Zhao, R., Chia, Y . K., Ding, B., Joty, S., Poria, S., and Bing, L. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. In ICLR, 2023

work page 2023
[37]

Teaching models to express their uncertainty in words

Lin, S., Hilton, J., and Evans, O. Teaching models to express their uncertainty in words. TMLR, 2023

work page 2023
[38]

Classification and regression trees.Wiley interdisciplinary reviews: data mining and knowledge discovery, 2011

Loh, W.-Y . Classification and regression trees.Wiley interdisciplinary reviews: data mining and knowledge discovery, 2011

work page 2011
[39]

Zero-resource hallucination prevention for large language models

Luo, J., Xiao, C., and Ma, F. Zero-resource hallucination prevention for large language models. arXiv:2309.02654, 2023

work page arXiv 2023
[40]

Simple probes can catch sleeper agents, 2024

MacDiarmid, M., Maxwell, T., Schiefer, N., Mu, J., Kaplan, J., Duvenaud, D., Bowman, S., Tamkin, A., Perez, E., Sharma, M., Denison, C., and Hubinger, E. Simple probes can catch sleeper agents, 2024. URL https://www.anthropic.com/news/probes-catch-sleeper-agents

work page 2024
[41]

and Gales, M

Malinin, A. and Gales, M. Uncertainty estimation in autoregressive structured prediction. ICLR, 2021

work page 2021
[42]

Manakul, P., Liusie, A., and Gales, M. J. Mqag: Multiple-choice question answering and generation for assessing information consistency in summarization. IJCNLP-AACL, 2023

work page 2023
[43]

Manakul, P., Liusie, A., and Gales, M. J. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[44]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Marks, S. and Tegmark, M. The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets. arXiv 2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

On faithfulness and factuality in abstractive summarization

Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. On faithfulness and factuality in abstractive summarization. In ACL, 2020

work page 2020
[46]

Introducing meta llama 3: The most capable openly available llm to date, 2024

Meta. Introducing meta llama 3: The most capable openly available llm to date, 2024. URL https: //ai.meta.com/blog/meta-llama-3/. [Online; accessed June 16 2024]

work page 2024
[47]

J., Szlam, A., Boureau, Y .-L., and Dinan, E

Mielke, S. J., Szlam, A., Boureau, Y .-L., and Dinan, E. Reducing conversational agents’ overconfidence through linguistic calibration. TACL, 2022

work page 2022
[48]

W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H

Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.-t., Koh, P. W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. EMNLP, 2023

work page 2023
[49]

Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation

Mündler, N., He, J., Jenko, S., and Vechev, M. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv:2305.15852, 2023

work page arXiv 2023
[50]

and Chiang, D

Murray, K. and Chiang, D. Correcting length bias in neural machine translation. WMT, 2018. 12

work page 2018
[51]

Nan, F., Santos, C. N. d., Zhu, H., Ng, P., McKeown, K., Nallapati, R., Zhang, D., Wang, Z., Arnold, A. O., and Xiang, B. Improving factual consistency of abstractive summarization via question answering. ACL-IJCNLP, 2021

work page 2021
[52]

Fact finding: Attempting to reverse-engineer factual recall on the neuron level, Dec 2023

Nanda, N., Rajamanoharan, S., Kramar, J., and Shah, R. Fact finding: Attempting to reverse-engineer factual recall on the neuron level, Dec 2023. URL https://www.alignmentforum.org/posts/ iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall

work page 2023
[53]

L., Tessem, B., Dang-Nguyen, D.-T., Motta, E., Setty, V ., Throndsen, E., Tverberg, A., and Trattner, C

Opdahl, A. L., Tessem, B., Dang-Nguyen, D.-T., Motta, E., Setty, V ., Throndsen, E., Tverberg, A., and Trattner, C. Trustworthy journalism through AI. Data Knowl. Eng., 2023

work page 2023
[54]

GPT-4 technical report, 2023

OpenAI. GPT-4 technical report, 2023

work page 2023
[55]

Future lens: Anticipating subsequent tokens from a single hidden state

Pal, K., Sun, J., Yuan, A., Wallace, B., and Bau, D. Future lens: Anticipating subsequent tokens from a single hidden state. In CoNLL, 2023

work page 2023
[56]

Scikit-learn: Machine learning in Python.JMLR, 12, 2011

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., et al. Scikit-learn: Machine learning in Python.JMLR, 12, 2011

work page 2011
[57]

Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

Peng, B., Galley, M., He, P., Cheng, H., Xie, Y ., Hu, Y ., Huang, Q., Liden, L., Yu, Z., Chen, W., et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv:2302.12813, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

H., and Riedel, S

Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y ., Miller, A. H., and Riedel, S. Language models as knowledge bases? EMNLP, 2019

work page 2019
[59]

Know what you don’t know: Unanswerable questions for squad.ACL, 2018

Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad.ACL, 2018

work page 2018
[60]

A Survey of Hallucination in Large Foundation Models

Rawte, V ., Sheth, A., and Das, A. A survey of hallucination in large foundation models.arXiv:2309.05922, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition. arXiv:2312.06681, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

How much knowledge can you pack into the parameters of a language model? EMNLP, 2020

Roberts, A., Raffel, C., and Shazeer, N. How much knowledge can you pack into the parameters of a language model? EMNLP, 2020

work page 2020
[63]

D., Reig, B., Shih, G., and Moy, L

Shen, Y ., Heacock, L., Elias, J., Hentel, K. D., Reig, B., Shih, G., and Moy, L. ChatGPT and other large language models are double-edged swords. Radiology, 2023

work page 2023
[64]

Shi, W., Han, X., Lewis, M., Tsvetkov, Y ., Zettlemoyer, L., and Yih, S. W.-t. Trusting your evidence: Hallucinate less with context-aware decoding. NAACL, 2023

work page 2023
[65]

S., Wei, J., Chung, H

Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Scharli, N., Chowdhery, A., Mansfield, P., y Arcas, B. A., Webster, D., Corrado, G. S., Matias, Y ., Chou, K., Gottweis, J., Tomasev, N., Liu, Y ., Rajkomar, A., Barral, J., Semturs, ...

work page 2023
[66]

Read before generate! faithful long form question answering with machine reading

Su, D., Li, X., Zhang, J., Shang, L., Jiang, X., Liu, Q., and Fung, P. Read before generate! faithful long form question answering with machine reading. ACL, 2022

work page 2022
[67]

Subramani, N., Suresh, N., and Peters, M. E. Extracting latent steering vectors from pretrained language models. ACL, 2022

work page 2022
[68]

Team, T. G. Gemini: a family of highly capable multimodal models. 2023

work page 2023
[69]

D., and Finn, C

Tian, K., Mitchell, E., Yao, H., Manning, C. D., and Finn, C. Fine-tuning language models for factuality. ICLR, 2024

work page 2024
[71]

URL https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almirantis, Y ., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artiéres, T., Ngomo, A.-C

Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers, M. R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almirantis, Y ., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artiéres, T., Ngomo, A.-C. N., Heino, N., Gaussier, E., Barrio-Alvers, L., Schroeder, M., Androutsopoulos, I., and Paliouras, G. An ove...

work page 2015
[74]

Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting

Turpin, M., Michael, J., Perez, E., and Bowman, S. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. NeurIPS, 2023

work page 2023
[75]

arXiv preprint arXiv:2307.03987 , year=

Varshney, N., Yao, W., Zhang, H., Chen, J., and Yu, D. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by actively validating low-confidence generation. arXiv 2307.03987, 2023

work page arXiv 2023
[76]

Asking and answering questions to evaluate the factual consistency of summaries

Wang, A., Cho, K., and Lewis, M. Asking and answering questions to evaluate the factual consistency of summaries. ACL, 2020

work page 2020
[77]

Lawyer who used ChatGPT faces penalty for made up citations

Weiser, B. Lawyer who used ChatGPT faces penalty for made up citations. The New York Times, June 2023

work page 2023
[78]

Zhang, S., Pan, L., Zhao, J., and Wang, W. Y . Mitigating language model hallucination with interactive question-knowledge alignment. arXiv:2305.13669, 2023

work page arXiv 2023
[79]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Zhang, Y ., Li, Y ., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y ., Chen, Y ., et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv:2309.01219, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to ai transparency. arXiv:2310.01405, 2023. 14 A Additional Results Model Task Accuracies. We report the accuracies achieved by the models on the various datasets used in this work in Table 3...

work page internal anchor Pith review Pith/arXiv arXiv 2023