pith. machine review for the scientific record. sign in

arxiv: 2406.15927 · v1 · pith:J3PD6HKJnew · submitted 2024-06-22 · 💻 cs.CL · cs.AI· cs.LG

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Pith reviewed 2026-05-18 00:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords semantic entropyhallucination detectionuncertainty quantificationlarge language modelsprobing methodshidden states
0
0 comments X

The pith

Semantic entropy probes detect hallucinations in LLMs using hidden states from a single generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces semantic entropy probes to approximate semantic entropy without multiple generations. This approach reduces the computational overhead of uncertainty quantification to nearly zero. It maintains strong performance in detecting hallucinations and generalizes better to new data distributions than prior methods. A sympathetic reader would care because it makes reliable hallucination detection practical for real-world LLM use.

Core claim

Semantic entropy probes (SEPs) are simple classifiers trained to estimate semantic entropy directly from the hidden states of one model generation, retaining high hallucination detection performance and better out-of-distribution generalization than accuracy-based probes.

What carries the argument

Semantic entropy probes, trained on hidden states to recover semantic entropy estimates from single generations.

If this is right

  • Hallucination detection no longer requires sampling 5-10 generations at test time.
  • Uncertainty quantification overhead drops to almost zero after probe training.
  • Probes generalize better to out-of-distribution data compared to direct accuracy prediction methods.
  • Model hidden states at certain layers and token positions encode semantic entropy information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If hidden states capture semantic entropy this way, similar probes might work for other forms of uncertainty.
  • This could enable always-on uncertainty monitoring in production LLM systems without extra compute.
  • Insights from ablations on layers and positions might guide more efficient model designs for interpretability.

Load-bearing premise

Hidden states from a single generation hold sufficient information about semantic entropy for a simple probe to recover it accurately across tasks and models.

What would settle it

Training a probe on single-generation hidden states and finding it does not predict semantic entropy values computed from multiple samples on new tasks.

read the original abstract

We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in Large Language Models (LLMs). Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. (2024) proposes semantic entropy (SE), which can detect hallucinations by estimating uncertainty in the space semantic meaning for a set of model generations. However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption. To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation. SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero. We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy. Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Semantic Entropy Probes (SEPs), simple trained probes that approximate semantic entropy (SE) directly from the hidden states of a single LLM generation. This enables cheap hallucination detection without sampling multiple generations at test time, while claiming to retain high performance and exhibit better out-of-distribution generalization than prior probes that directly predict model accuracy. Results are reported across models and tasks, with ablations on token positions and layers.

Significance. If the empirical results hold under closer scrutiny, the work is significant for making semantic uncertainty quantification practical in deployment settings. Reducing the 5-10x compute overhead of SE to near-zero while preserving hallucination detection performance, and especially the reported OOD gains, would be a useful advance over both sampling-based SE and accuracy-probing baselines. The suggestion that hidden states implicitly capture semantic entropy is a falsifiable claim with potential for follow-on work.

major comments (3)
  1. [Results / Ablations] The central claim that SEPs recover semantic entropy (rather than surface-level uncertainty signals) from a single generation's hidden states is load-bearing but under-supported. The skeptic concern lands: nothing in the architecture guarantees that activations encode the breadth of the semantic distribution over meaning clusters instead of local features of the realized sequence. The manuscript would benefit from an explicit control (e.g., comparing probe performance against a baseline that only uses token probabilities or embedding norms) in the main results or ablations section.
  2. [Experimental Setup] Soundness is limited by missing experimental details. The abstract states retained performance and better OOD generalization, yet the text provides neither error bars across runs, full hyperparameter tables, nor complete ablation results on layer/token choices. Without these, it is difficult to assess whether the reported gains are robust or whether the probe is simply fitting to the SE-derived training labels.
  3. [Methods] The training procedure introduces a mild circularity risk that should be quantified. Because SEPs are supervised on labels derived from multi-sample semantic entropy calculations, it is unclear how much of the test-time performance reflects genuine approximation from hidden states versus leakage of the original SE signal through the training distribution. A leave-one-task-out or cross-model transfer experiment would clarify this.
minor comments (2)
  1. [Methods] Notation for the probe architecture (linear vs. shallow MLP) and the exact layer/token position chosen for probing should be stated more explicitly in the main text rather than deferred to the appendix.
  2. [Abstract] The abstract mentions 'across models and tasks' but does not list the specific models, datasets, or number of runs; adding a compact table or sentence would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity, rigor, and support for our claims.

read point-by-point responses
  1. Referee: The central claim that SEPs recover semantic entropy (rather than surface-level uncertainty signals) from a single generation's hidden states is load-bearing but under-supported. The skeptic concern lands: nothing in the architecture guarantees that activations encode the breadth of the semantic distribution over meaning clusters instead of local features of the realized sequence. The manuscript would benefit from an explicit control (e.g., comparing probe performance against a baseline that only uses token probabilities or embedding norms) in the main results or ablations section.

    Authors: We agree that an explicit control experiment would provide stronger evidence that SEPs capture semantic entropy information rather than simpler surface-level signals. In the revised manuscript we have added a baseline probe trained directly on token probabilities and embedding norms extracted from the single generation. Results show that SEPs consistently outperform this baseline on hallucination detection across models and tasks, supporting the claim that hidden states encode semantic-level information beyond local sequence features. We have placed these comparisons in the main results and expanded the discussion of what this implies for the information captured by the probes. revision: yes

  2. Referee: Soundness is limited by missing experimental details. The abstract states retained performance and better OOD generalization, yet the text provides neither error bars across runs, full hyperparameter tables, nor complete ablation results on layer/token choices. Without these, it is difficult to assess whether the reported gains are robust or whether the probe is simply fitting to the SE-derived training labels.

    Authors: We acknowledge that additional experimental details are necessary for assessing robustness. The revised manuscript now includes error bars computed over five independent runs with different random seeds for all main results. A complete hyperparameter table (including probe architecture, learning rate, batch size, and regularization) has been added to the appendix. We have also expanded the ablation section to report performance for every combination of layer and token position, together with statistical comparisons. These additions allow readers to evaluate whether gains are stable and whether the probe is overfitting to the training labels. revision: yes

  3. Referee: The training procedure introduces a mild circularity risk that should be quantified. Because SEPs are supervised on labels derived from multi-sample semantic entropy calculations, it is unclear how much of the test-time performance reflects genuine approximation from hidden states versus leakage of the original SE signal through the training distribution. A leave-one-task-out or cross-model transfer experiment would clarify this.

    Authors: We agree that quantifying potential leakage from the multi-sample SE labels is valuable. In the revision we have added leave-one-task-out experiments in which the probe is trained on all tasks except one and evaluated on the held-out task. Performance remains competitive with the original in-distribution results, indicating that SEPs learn generalizable mappings from hidden states rather than merely memorizing task-specific SE patterns. We have also expanded the existing cross-model transfer results and included a brief analysis of how much performance drops when the training and test distributions differ more substantially. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probe training on independently computed targets

full rationale

The paper defines semantic entropy via prior work (Farquhar et al. 2024) and trains a probe to approximate that quantity from single-generation hidden states. This is a standard supervised regression setup whose targets are computed separately from multiple samples; the probe weights are fitted on held-out data and evaluated on hallucination detection metrics. No equation reduces the target to the probe output by construction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; therefore free parameters, axioms, and invented entities are inferred at a high level from the described method.

free parameters (1)
  • probe weights
    The semantic entropy probe is trained on hidden-state features, implying fitted parameters whose values are not reported in the abstract.
axioms (1)
  • domain assumption Hidden states of a single generation encode semantic entropy information
    Central premise required for the probe to recover semantic entropy without multiple samples.

pith-pipeline@v0.9.0 · 5753 in / 1085 out tokens · 29197 ms · 2026-05-18T00:47:49.686767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. Inducing Artificial Uncertainty in Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

  3. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

    cs.AI 2026-05 unverdicted novelty 7.0

    Transformer hidden states encode facts as attractor basins; hallucinations occur from basin absence and conflicts from basin competition, detected cleanly by geometric margin rather than entropy.

  4. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

    cs.AI 2026-05 unverdicted novelty 7.0

    Attractor basins in transformer hidden states unify conflict and hallucination as basin competition or absence, with geometric margin outperforming entropy for detection and a scaling law governing confident hallucina...

  5. Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

    cs.SE 2025-09 unverdicted novelty 7.0

    Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to pr...

  6. The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

    cs.AI 2026-05 unverdicted novelty 6.0

    Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.

  7. Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits

    cs.CL 2026-05 unverdicted novelty 6.0

    Probabilistic circuits detect LLM hallucinations as residual-stream anomalies with up to 99% AUROC and enable dynamic correction that raises truthfulness scores while cutting unnecessary output corruption.

  8. Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training

    cs.AI 2026-05 conditional novelty 6.0

    Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant ...

  9. To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs often misalign their self-perceived need for tools with true need and utility, but lightweight estimators trained on hidden states can improve tool-calling decisions and task performance across multiple models and tasks.

  10. The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive

    cs.CR 2026-04 unverdicted novelty 6.0

    LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.

  11. Process Supervision of Confidence Margin for Calibrated LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

  12. Convergent Evolution: How Different Language Models Learn Similar Number Representations

    cs.CL 2026-04 unverdicted novelty 6.0

    Diverse language models converge on similar periodic number features with a two-tier hierarchy of Fourier sparsity and geometric separability, acquired via language co-occurrences or multi-token arithmetic.

  13. Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.

  14. Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

    cs.LG 2026-04 unverdicted novelty 6.0

    Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.

  15. Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

    cs.CL 2026-04 unverdicted novelty 6.0

    SHADE adaptively combines coverage and spectral signals to estimate semantic alphabet size from few LLM samples, yielding better performance than baselines in low-sample regimes for alphabet estimation and QA error detection.

  16. Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

    cs.AI 2026-04 unverdicted novelty 6.0

    Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.

  17. Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs

    cs.CL 2026-03 unverdicted novelty 6.0

    Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.

  18. High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

    cs.CV 2025-12 unverdicted novelty 6.0

    High-entropy tokens act as concentrated multimodal failure points in VLMs, enabling sparse Entropy-Guided Attacks that achieve 93-95% success and 30-38% harmful rates with cross-model transfer.

  19. Large Lemma Miners: Can LLMs do Induction Proofs for Hardware?

    cs.LO 2025-11 conditional novelty 6.0

    A neurosymbolic method using two LLM prompting frameworks generates provably correct inductive arguments for 84% of a set of mid-size open-source RTL hardware designs.

  20. GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models

    cs.CL 2025-09 unverdicted novelty 6.0

    GrACE is a fine-tuned generative method that uses similarity to a special token embedding for real-time calibrated confidence in LLMs and enables efficient confidence-based test-time scaling.

  21. Chain-of-Thought as a Lens: Evaluating Structured Reasoning Alignment between Human Preferences and Large Language Models

    cs.AI 2025-11 unverdicted novelty 5.0

    The Alignment Score quantifies semantic divergence between model-generated and human-preferred reasoning chains and correlates with accuracy, readability, and coherence.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 20 Pith papers · 13 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C. C. T., Chen, W., Chaudhary, V ., Chopra, P., Giorno, A. D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J...

  2. [2]

    Agrawal, A., Mackey, L., and Kalai, A. T. Do language models know when they’re hallucinating references? In EACL, 2024

  3. [3]

    and Bengio, Y

    Alain, G. and Bengio, Y . Understanding intermediate layers using linear classifier probes. InICLR, 2017

  4. [4]

    and Mitchell, T

    Azaria, A. and Mitchell, T. The internal state of an llm knows when it’s lying. In EMNLP, 2023. 10

  5. [5]

    Linguistic calibration of language models

    Band, N., Li, X., Ma, T., and Hashimoto, T. Linguistic calibration of language models. arXiv:2404.00474, 2024

  6. [6]

    Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 2021

    Belinkov, Y . Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 2021

  7. [7]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., and Steinhardt, J. Eliciting latent predictions from transformers with the tuned lens. arXiv 2303.08112, 2023

  8. [8]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. NeurIPS, 2020

  9. [9]

    Discovering latent knowledge in language models without supervision

    Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. In ICLR, 2023

  10. [10]

    Cao, M., Dong, Y ., and Cheung, J. C. K. Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In ACL, 2022

  11. [11]

    and Mueller, J

    Chen, J. and Mueller, J. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness. arXiv 2308.16175, 2023

  12. [12]

    Dola: Decoding by contrasting layers improves factuality in large language models

    Chuang, Y .-S., Xie, Y ., Luo, H., Kim, Y ., Glass, J., and He, P. Dola: Decoding by contrasting layers improves factuality in large language models. In ICLR, 2024

  13. [13]

    R., Zhang, M

    Cole, J. R., Zhang, M. J., Gillick, D., Eisenschlos, J. M., Dhingra, B., and Eisenstein, J. Selectively answering ambiguous questions. EMNLP, 2023

  14. [14]

    Towards question-answering as an automatic metric for evaluating the content quality of a summary

    Deutsch, D., Bedrax-Weiss, T., and Roth, D. Towards question-answering as an automatic metric for evaluating the content quality of a summary. TACL, 2021

  15. [15]

    Chain-of-Verification Reduces Hallucination in Large Language Models

    Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., and Weston, J. Chain-of- verification reduces hallucination in large language models. arXiv:2309.11495, 2023

  16. [16]

    Shifting attention to relevance: Towards the uncertainty estimation of large language models

    Duan, J., Cheng, H., Wang, S., Wang, C., Zavalny, A., Xu, R., Kailkhura, B., and Xu, K. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv:2307.01379, 2023

  17. [17]

    Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization

    Durmus, E., He, H., and Diab, M. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. ACL, 2020

  18. [18]

    Dziri, N., Madotto, A., Zaïane, O., and Bose, A. J. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In EMNLP, 2021

  19. [19]

    Halo: Estimation and reduction of hallucinations in open-source weak large language models

    Elaraby, M., Lu, M., Dunn, J., Zhang, X., Wang, Y ., and Liu, S. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv:2308.11764, 2023

  20. [21]

    Detecting Hallucinations in Large Language Models Using Semantic Entropy

    Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y . Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 2024

  21. [22]

    R., and Pan, S

    Feldman, P., Foulds, J. R., and Pan, S. Trapping llm hallucinations using tagged context prompts. arXiv:2306.06085, 2023

  22. [23]

    Controlled hallucinations: Learning to generate faithfully from noisy data

    Filippova, K. Controlled hallucinations: Learning to generate faithfully from noisy data. In EMNLP, 2020

  23. [24]

    T., Fan, Y ., Zhao, V

    Gao, L., Dai, Z., Pasupat, P., Chen, A., Chaganty, A. T., Fan, Y ., Zhao, V . Y ., Lao, N., Lee, H., Juan, D.-C., et al. Rarr: Researching and revising what language models say, using language models. In ACL, 2022

  24. [25]

    Deberta: Decoding-enhanced bert with disentangled attention

    He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. In ICLR, 2021

  25. [26]

    Z., and Andreas, J

    Hernandez, E., Li, B. Z., and Andreas, J. Measuring and manipulating knowledge representations in language models. arXiv:2304.00740, 2023

  26. [27]

    J., Madotto, A., and Fung, P

    Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y ., Ishii, E., Bang, Y . J., Madotto, A., and Fung, P. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

  27. [28]

    Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D

    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b.arXiv, 2023. 11

  28. [29]

    S., and Zettlemoyer, L

    Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. ACL, 2017

  29. [30]

    Language Models (Mostly) Know What They Know

    Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Dodds, Z. H., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know.arXiv:2207.05221, 2022

  30. [31]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Kuhn, L., Gal, Y ., and Farquhar, S. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In ICLR, 2023

  31. [32]

    N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S

    Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. Natural questions: a benchmark for question answering research. TACL, 2019

  32. [33]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., et al. Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702, 2023

  33. [34]

    N., Shoeybi, M., and Catanzaro, B

    Lee, N., Ping, W., Xu, P., Patwary, M., Fung, P. N., Shoeybi, M., and Catanzaro, B. Factuality enhanced language models for open-ended text generation. NeurIPS, 2022

  34. [35]

    Inference-time intervention: Eliciting truthful answers from a language model

    Li, K., Patel, O., Viégas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful answers from a language model. NeurIPS, 36, 2024

  35. [36]

    K., Ding, B., Joty, S., Poria, S., and Bing, L

    Li, X., Zhao, R., Chia, Y . K., Ding, B., Joty, S., Poria, S., and Bing, L. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. In ICLR, 2023

  36. [37]

    Teaching models to express their uncertainty in words

    Lin, S., Hilton, J., and Evans, O. Teaching models to express their uncertainty in words. TMLR, 2023

  37. [38]

    Classification and regression trees.Wiley interdisciplinary reviews: data mining and knowledge discovery, 2011

    Loh, W.-Y . Classification and regression trees.Wiley interdisciplinary reviews: data mining and knowledge discovery, 2011

  38. [39]

    Zero-resource hallucination prevention for large language models

    Luo, J., Xiao, C., and Ma, F. Zero-resource hallucination prevention for large language models. arXiv:2309.02654, 2023

  39. [40]

    Simple probes can catch sleeper agents, 2024

    MacDiarmid, M., Maxwell, T., Schiefer, N., Mu, J., Kaplan, J., Duvenaud, D., Bowman, S., Tamkin, A., Perez, E., Sharma, M., Denison, C., and Hubinger, E. Simple probes can catch sleeper agents, 2024. URL https://www.anthropic.com/news/probes-catch-sleeper-agents

  40. [41]

    and Gales, M

    Malinin, A. and Gales, M. Uncertainty estimation in autoregressive structured prediction. ICLR, 2021

  41. [42]

    Manakul, P., Liusie, A., and Gales, M. J. Mqag: Multiple-choice question answering and generation for assessing information consistency in summarization. IJCNLP-AACL, 2023

  42. [43]

    Manakul, P., Liusie, A., and Gales, M. J. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Conference on Empirical Methods in Natural Language Processing, 2023

  43. [44]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    Marks, S. and Tegmark, M. The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets. arXiv 2310.06824, 2023

  44. [45]

    On faithfulness and factuality in abstractive summarization

    Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. On faithfulness and factuality in abstractive summarization. In ACL, 2020

  45. [46]

    Introducing meta llama 3: The most capable openly available llm to date, 2024

    Meta. Introducing meta llama 3: The most capable openly available llm to date, 2024. URL https: //ai.meta.com/blog/meta-llama-3/. [Online; accessed June 16 2024]

  46. [47]

    J., Szlam, A., Boureau, Y .-L., and Dinan, E

    Mielke, S. J., Szlam, A., Boureau, Y .-L., and Dinan, E. Reducing conversational agents’ overconfidence through linguistic calibration. TACL, 2022

  47. [48]

    W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H

    Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.-t., Koh, P. W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. EMNLP, 2023

  48. [49]

    Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation

    Mündler, N., He, J., Jenko, S., and Vechev, M. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv:2305.15852, 2023

  49. [50]

    and Chiang, D

    Murray, K. and Chiang, D. Correcting length bias in neural machine translation. WMT, 2018. 12

  50. [51]

    Nan, F., Santos, C. N. d., Zhu, H., Ng, P., McKeown, K., Nallapati, R., Zhang, D., Wang, Z., Arnold, A. O., and Xiang, B. Improving factual consistency of abstractive summarization via question answering. ACL-IJCNLP, 2021

  51. [52]

    Fact finding: Attempting to reverse-engineer factual recall on the neuron level, Dec 2023

    Nanda, N., Rajamanoharan, S., Kramar, J., and Shah, R. Fact finding: Attempting to reverse-engineer factual recall on the neuron level, Dec 2023. URL https://www.alignmentforum.org/posts/ iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall

  52. [53]

    L., Tessem, B., Dang-Nguyen, D.-T., Motta, E., Setty, V ., Throndsen, E., Tverberg, A., and Trattner, C

    Opdahl, A. L., Tessem, B., Dang-Nguyen, D.-T., Motta, E., Setty, V ., Throndsen, E., Tverberg, A., and Trattner, C. Trustworthy journalism through AI. Data Knowl. Eng., 2023

  53. [54]

    GPT-4 technical report, 2023

    OpenAI. GPT-4 technical report, 2023

  54. [55]

    Future lens: Anticipating subsequent tokens from a single hidden state

    Pal, K., Sun, J., Yuan, A., Wallace, B., and Bau, D. Future lens: Anticipating subsequent tokens from a single hidden state. In CoNLL, 2023

  55. [56]

    Scikit-learn: Machine learning in Python.JMLR, 12, 2011

    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., et al. Scikit-learn: Machine learning in Python.JMLR, 12, 2011

  56. [57]

    Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

    Peng, B., Galley, M., He, P., Cheng, H., Xie, Y ., Hu, Y ., Huang, Q., Liden, L., Yu, Z., Chen, W., et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv:2302.12813, 2023

  57. [58]

    H., and Riedel, S

    Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y ., Miller, A. H., and Riedel, S. Language models as knowledge bases? EMNLP, 2019

  58. [59]

    Know what you don’t know: Unanswerable questions for squad.ACL, 2018

    Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad.ACL, 2018

  59. [60]

    A Survey of Hallucination in Large Foundation Models

    Rawte, V ., Sheth, A., and Das, A. A survey of hallucination in large foundation models.arXiv:2309.05922, 2023

  60. [61]

    Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition. arXiv:2312.06681, 2023

  61. [62]

    How much knowledge can you pack into the parameters of a language model? EMNLP, 2020

    Roberts, A., Raffel, C., and Shazeer, N. How much knowledge can you pack into the parameters of a language model? EMNLP, 2020

  62. [63]

    D., Reig, B., Shih, G., and Moy, L

    Shen, Y ., Heacock, L., Elias, J., Hentel, K. D., Reig, B., Shih, G., and Moy, L. ChatGPT and other large language models are double-edged swords. Radiology, 2023

  63. [64]

    Shi, W., Han, X., Lewis, M., Tsvetkov, Y ., Zettlemoyer, L., and Yih, S. W.-t. Trusting your evidence: Hallucinate less with context-aware decoding. NAACL, 2023

  64. [65]

    S., Wei, J., Chung, H

    Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Scharli, N., Chowdhery, A., Mansfield, P., y Arcas, B. A., Webster, D., Corrado, G. S., Matias, Y ., Chou, K., Gottweis, J., Tomasev, N., Liu, Y ., Rajkomar, A., Barral, J., Semturs, ...

  65. [66]

    Read before generate! faithful long form question answering with machine reading

    Su, D., Li, X., Zhang, J., Shang, L., Jiang, X., Liu, Q., and Fung, P. Read before generate! faithful long form question answering with machine reading. ACL, 2022

  66. [67]

    Subramani, N., Suresh, N., and Peters, M. E. Extracting latent steering vectors from pretrained language models. ACL, 2022

  67. [68]

    Team, T. G. Gemini: a family of highly capable multimodal models. 2023

  68. [69]

    D., and Finn, C

    Tian, K., Mitchell, E., Yao, H., Manning, C. D., and Finn, C. Fine-tuning language models for factuality. ICLR, 2024

  69. [71]

    URL https://arxiv.org/abs/2302.13971

  70. [72]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023. 13

  71. [73]

    R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almirantis, Y ., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artiéres, T., Ngomo, A.-C

    Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers, M. R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almirantis, Y ., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artiéres, T., Ngomo, A.-C. N., Heino, N., Gaussier, E., Barrio-Alvers, L., Schroeder, M., Androutsopoulos, I., and Paliouras, G. An ove...

  72. [74]

    Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting

    Turpin, M., Michael, J., Perez, E., and Bowman, S. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. NeurIPS, 2023

  73. [75]

    arXiv preprint arXiv:2307.03987 , year=

    Varshney, N., Yao, W., Zhang, H., Chen, J., and Yu, D. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by actively validating low-confidence generation. arXiv 2307.03987, 2023

  74. [76]

    Asking and answering questions to evaluate the factual consistency of summaries

    Wang, A., Cho, K., and Lewis, M. Asking and answering questions to evaluate the factual consistency of summaries. ACL, 2020

  75. [77]

    Lawyer who used ChatGPT faces penalty for made up citations

    Weiser, B. Lawyer who used ChatGPT faces penalty for made up citations. The New York Times, June 2023

  76. [78]

    Zhang, S., Pan, L., Zhao, J., and Wang, W. Y . Mitigating language model hallucination with interactive question-knowledge alignment. arXiv:2305.13669, 2023

  77. [79]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Zhang, Y ., Li, Y ., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y ., Chen, Y ., et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv:2309.01219, 2023

  78. [80]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to ai transparency. arXiv:2310.01405, 2023. 14 A Additional Results Model Task Accuracies. We report the accuracies achieved by the models on the various datasets used in this work in Table 3...