Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
Pith reviewed 2026-05-18 00:47 UTC · model grok-4.3
The pith
Semantic entropy probes detect hallucinations in LLMs using hidden states from a single generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Semantic entropy probes (SEPs) are simple classifiers trained to estimate semantic entropy directly from the hidden states of one model generation, retaining high hallucination detection performance and better out-of-distribution generalization than accuracy-based probes.
What carries the argument
Semantic entropy probes, trained on hidden states to recover semantic entropy estimates from single generations.
If this is right
- Hallucination detection no longer requires sampling 5-10 generations at test time.
- Uncertainty quantification overhead drops to almost zero after probe training.
- Probes generalize better to out-of-distribution data compared to direct accuracy prediction methods.
- Model hidden states at certain layers and token positions encode semantic entropy information.
Where Pith is reading between the lines
- If hidden states capture semantic entropy this way, similar probes might work for other forms of uncertainty.
- This could enable always-on uncertainty monitoring in production LLM systems without extra compute.
- Insights from ablations on layers and positions might guide more efficient model designs for interpretability.
Load-bearing premise
Hidden states from a single generation hold sufficient information about semantic entropy for a simple probe to recover it accurately across tasks and models.
What would settle it
Training a probe on single-generation hidden states and finding it does not predict semantic entropy values computed from multiple samples on new tasks.
read the original abstract
We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in Large Language Models (LLMs). Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. (2024) proposes semantic entropy (SE), which can detect hallucinations by estimating uncertainty in the space semantic meaning for a set of model generations. However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption. To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation. SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero. We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy. Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Semantic Entropy Probes (SEPs), simple trained probes that approximate semantic entropy (SE) directly from the hidden states of a single LLM generation. This enables cheap hallucination detection without sampling multiple generations at test time, while claiming to retain high performance and exhibit better out-of-distribution generalization than prior probes that directly predict model accuracy. Results are reported across models and tasks, with ablations on token positions and layers.
Significance. If the empirical results hold under closer scrutiny, the work is significant for making semantic uncertainty quantification practical in deployment settings. Reducing the 5-10x compute overhead of SE to near-zero while preserving hallucination detection performance, and especially the reported OOD gains, would be a useful advance over both sampling-based SE and accuracy-probing baselines. The suggestion that hidden states implicitly capture semantic entropy is a falsifiable claim with potential for follow-on work.
major comments (3)
- [Results / Ablations] The central claim that SEPs recover semantic entropy (rather than surface-level uncertainty signals) from a single generation's hidden states is load-bearing but under-supported. The skeptic concern lands: nothing in the architecture guarantees that activations encode the breadth of the semantic distribution over meaning clusters instead of local features of the realized sequence. The manuscript would benefit from an explicit control (e.g., comparing probe performance against a baseline that only uses token probabilities or embedding norms) in the main results or ablations section.
- [Experimental Setup] Soundness is limited by missing experimental details. The abstract states retained performance and better OOD generalization, yet the text provides neither error bars across runs, full hyperparameter tables, nor complete ablation results on layer/token choices. Without these, it is difficult to assess whether the reported gains are robust or whether the probe is simply fitting to the SE-derived training labels.
- [Methods] The training procedure introduces a mild circularity risk that should be quantified. Because SEPs are supervised on labels derived from multi-sample semantic entropy calculations, it is unclear how much of the test-time performance reflects genuine approximation from hidden states versus leakage of the original SE signal through the training distribution. A leave-one-task-out or cross-model transfer experiment would clarify this.
minor comments (2)
- [Methods] Notation for the probe architecture (linear vs. shallow MLP) and the exact layer/token position chosen for probing should be stated more explicitly in the main text rather than deferred to the appendix.
- [Abstract] The abstract mentions 'across models and tasks' but does not list the specific models, datasets, or number of runs; adding a compact table or sentence would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity, rigor, and support for our claims.
read point-by-point responses
-
Referee: The central claim that SEPs recover semantic entropy (rather than surface-level uncertainty signals) from a single generation's hidden states is load-bearing but under-supported. The skeptic concern lands: nothing in the architecture guarantees that activations encode the breadth of the semantic distribution over meaning clusters instead of local features of the realized sequence. The manuscript would benefit from an explicit control (e.g., comparing probe performance against a baseline that only uses token probabilities or embedding norms) in the main results or ablations section.
Authors: We agree that an explicit control experiment would provide stronger evidence that SEPs capture semantic entropy information rather than simpler surface-level signals. In the revised manuscript we have added a baseline probe trained directly on token probabilities and embedding norms extracted from the single generation. Results show that SEPs consistently outperform this baseline on hallucination detection across models and tasks, supporting the claim that hidden states encode semantic-level information beyond local sequence features. We have placed these comparisons in the main results and expanded the discussion of what this implies for the information captured by the probes. revision: yes
-
Referee: Soundness is limited by missing experimental details. The abstract states retained performance and better OOD generalization, yet the text provides neither error bars across runs, full hyperparameter tables, nor complete ablation results on layer/token choices. Without these, it is difficult to assess whether the reported gains are robust or whether the probe is simply fitting to the SE-derived training labels.
Authors: We acknowledge that additional experimental details are necessary for assessing robustness. The revised manuscript now includes error bars computed over five independent runs with different random seeds for all main results. A complete hyperparameter table (including probe architecture, learning rate, batch size, and regularization) has been added to the appendix. We have also expanded the ablation section to report performance for every combination of layer and token position, together with statistical comparisons. These additions allow readers to evaluate whether gains are stable and whether the probe is overfitting to the training labels. revision: yes
-
Referee: The training procedure introduces a mild circularity risk that should be quantified. Because SEPs are supervised on labels derived from multi-sample semantic entropy calculations, it is unclear how much of the test-time performance reflects genuine approximation from hidden states versus leakage of the original SE signal through the training distribution. A leave-one-task-out or cross-model transfer experiment would clarify this.
Authors: We agree that quantifying potential leakage from the multi-sample SE labels is valuable. In the revision we have added leave-one-task-out experiments in which the probe is trained on all tasks except one and evaluated on the held-out task. Performance remains competitive with the original in-distribution results, indicating that SEPs learn generalizable mappings from hidden states rather than merely memorizing task-specific SE patterns. We have also expanded the existing cross-model transfer results and included a brief analysis of how much performance drops when the training and test distributions differ more substantially. revision: yes
Circularity Check
No circularity: empirical probe training on independently computed targets
full rationale
The paper defines semantic entropy via prior work (Farquhar et al. 2024) and trains a probe to approximate that quantity from single-generation hidden states. This is a standard supervised regression setup whose targets are computed separately from multiple samples; the probe weights are fitted on held-out data and evaluated on hallucination detection metrics. No equation reduces the target to the probe output by construction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- probe weights
axioms (1)
- domain assumption Hidden states of a single generation encode semantic entropy information
Forward citations
Cited by 21 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
Transformer hidden states encode facts as attractor basins; hallucinations occur from basin absence and conflicts from basin competition, detected cleanly by geometric margin rather than entropy.
-
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
Attractor basins in transformer hidden states unify conflict and hallucination as basin competition or absence, with geometric margin outperforming entropy for detection and a scaling law governing confident hallucina...
-
Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs
Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to pr...
-
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
-
Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
Probabilistic circuits detect LLM hallucinations as residual-stream anomalies with up to 99% AUROC and enable dynamic correction that raises truthfulness scores while cutting unnecessary output corruption.
-
Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant ...
-
To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
LLMs often misalign their self-perceived need for tools with true need and utility, but lightweight estimators trained on hidden states can improve tool-calling decisions and task performance across multiple models and tasks.
-
The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive
LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.
-
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
-
Convergent Evolution: How Different Language Models Learn Similar Number Representations
Diverse language models converge on similar periodic number features with a two-tier hierarchy of Fourier sparsity and geometric separability, acquired via language co-occurrences or multi-token arithmetic.
-
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
-
Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation
Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
-
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
SHADE adaptively combines coverage and spectral signals to estimate semantic alphabet size from few LLM samples, yielding better performance than baselines in low-sample regimes for alphabet estimation and QA error detection.
-
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
-
Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs
Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.
-
High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models
High-entropy tokens act as concentrated multimodal failure points in VLMs, enabling sparse Entropy-Guided Attacks that achieve 93-95% success and 30-38% harmful rates with cross-model transfer.
-
Large Lemma Miners: Can LLMs do Induction Proofs for Hardware?
A neurosymbolic method using two LLM prompting frameworks generates provably correct inductive arguments for 84% of a set of mid-size open-source RTL hardware designs.
-
GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models
GrACE is a fine-tuned generative method that uses similarity to a special token embedding for real-time calibrated confidence in LLMs and enables efficient confidence-based test-time scaling.
-
Chain-of-Thought as a Lens: Evaluating Structured Reasoning Alignment between Human Preferences and Large Language Models
The Alignment Score quantifies semantic divergence between model-generated and human-preferred reasoning chains and correlates with accuracy, readability, and coherence.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C. C. T., Chen, W., Chaudhary, V ., Chopra, P., Giorno, A. D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Agrawal, A., Mackey, L., and Kalai, A. T. Do language models know when they’re hallucinating references? In EACL, 2024
work page 2024
-
[3]
Alain, G. and Bengio, Y . Understanding intermediate layers using linear classifier probes. InICLR, 2017
work page 2017
-
[4]
Azaria, A. and Mitchell, T. The internal state of an llm knows when it’s lying. In EMNLP, 2023. 10
work page 2023
-
[5]
Linguistic calibration of language models
Band, N., Li, X., Ma, T., and Hashimoto, T. Linguistic calibration of language models. arXiv:2404.00474, 2024
-
[6]
Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 2021
Belinkov, Y . Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 2021
work page 2021
-
[7]
Eliciting Latent Predictions from Transformers with the Tuned Lens
Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., and Steinhardt, J. Eliciting latent predictions from transformers with the tuned lens. arXiv 2303.08112, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. NeurIPS, 2020
work page 2020
-
[9]
Discovering latent knowledge in language models without supervision
Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. In ICLR, 2023
work page 2023
-
[10]
Cao, M., Dong, Y ., and Cheung, J. C. K. Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In ACL, 2022
work page 2022
-
[11]
Chen, J. and Mueller, J. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness. arXiv 2308.16175, 2023
-
[12]
Dola: Decoding by contrasting layers improves factuality in large language models
Chuang, Y .-S., Xie, Y ., Luo, H., Kim, Y ., Glass, J., and He, P. Dola: Decoding by contrasting layers improves factuality in large language models. In ICLR, 2024
work page 2024
-
[13]
Cole, J. R., Zhang, M. J., Gillick, D., Eisenschlos, J. M., Dhingra, B., and Eisenstein, J. Selectively answering ambiguous questions. EMNLP, 2023
work page 2023
-
[14]
Towards question-answering as an automatic metric for evaluating the content quality of a summary
Deutsch, D., Bedrax-Weiss, T., and Roth, D. Towards question-answering as an automatic metric for evaluating the content quality of a summary. TACL, 2021
work page 2021
-
[15]
Chain-of-Verification Reduces Hallucination in Large Language Models
Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., and Weston, J. Chain-of- verification reduces hallucination in large language models. arXiv:2309.11495, 2023
work page internal anchor Pith review arXiv 2023
-
[16]
Shifting attention to relevance: Towards the uncertainty estimation of large language models
Duan, J., Cheng, H., Wang, S., Wang, C., Zavalny, A., Xu, R., Kailkhura, B., and Xu, K. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv:2307.01379, 2023
-
[17]
Durmus, E., He, H., and Diab, M. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. ACL, 2020
work page 2020
-
[18]
Dziri, N., Madotto, A., Zaïane, O., and Bose, A. J. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In EMNLP, 2021
work page 2021
-
[19]
Halo: Estimation and reduction of hallucinations in open-source weak large language models
Elaraby, M., Lu, M., Dunn, J., Zhang, X., Wang, Y ., and Liu, S. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv:2308.11764, 2023
-
[21]
Detecting Hallucinations in Large Language Models Using Semantic Entropy
Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y . Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 2024
work page 2024
-
[22]
Feldman, P., Foulds, J. R., and Pan, S. Trapping llm hallucinations using tagged context prompts. arXiv:2306.06085, 2023
-
[23]
Controlled hallucinations: Learning to generate faithfully from noisy data
Filippova, K. Controlled hallucinations: Learning to generate faithfully from noisy data. In EMNLP, 2020
work page 2020
-
[24]
Gao, L., Dai, Z., Pasupat, P., Chen, A., Chaganty, A. T., Fan, Y ., Zhao, V . Y ., Lao, N., Lee, H., Juan, D.-C., et al. Rarr: Researching and revising what language models say, using language models. In ACL, 2022
work page 2022
-
[25]
Deberta: Decoding-enhanced bert with disentangled attention
He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. In ICLR, 2021
work page 2021
-
[26]
Hernandez, E., Li, B. Z., and Andreas, J. Measuring and manipulating knowledge representations in language models. arXiv:2304.00740, 2023
-
[27]
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y ., Ishii, E., Bang, Y . J., Madotto, A., and Fung, P. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023
work page 2023
-
[28]
Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b.arXiv, 2023. 11
work page 2023
-
[29]
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. ACL, 2017
work page 2017
-
[30]
Language Models (Mostly) Know What They Know
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Dodds, Z. H., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know.arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Kuhn, L., Gal, Y ., and Farquhar, S. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In ICLR, 2023
work page 2023
-
[32]
N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. Natural questions: a benchmark for question answering research. TACL, 2019
work page 2019
-
[33]
Measuring Faithfulness in Chain-of-Thought Reasoning
Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., et al. Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
N., Shoeybi, M., and Catanzaro, B
Lee, N., Ping, W., Xu, P., Patwary, M., Fung, P. N., Shoeybi, M., and Catanzaro, B. Factuality enhanced language models for open-ended text generation. NeurIPS, 2022
work page 2022
-
[35]
Inference-time intervention: Eliciting truthful answers from a language model
Li, K., Patel, O., Viégas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful answers from a language model. NeurIPS, 36, 2024
work page 2024
-
[36]
K., Ding, B., Joty, S., Poria, S., and Bing, L
Li, X., Zhao, R., Chia, Y . K., Ding, B., Joty, S., Poria, S., and Bing, L. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. In ICLR, 2023
work page 2023
-
[37]
Teaching models to express their uncertainty in words
Lin, S., Hilton, J., and Evans, O. Teaching models to express their uncertainty in words. TMLR, 2023
work page 2023
-
[38]
Loh, W.-Y . Classification and regression trees.Wiley interdisciplinary reviews: data mining and knowledge discovery, 2011
work page 2011
-
[39]
Zero-resource hallucination prevention for large language models
Luo, J., Xiao, C., and Ma, F. Zero-resource hallucination prevention for large language models. arXiv:2309.02654, 2023
-
[40]
Simple probes can catch sleeper agents, 2024
MacDiarmid, M., Maxwell, T., Schiefer, N., Mu, J., Kaplan, J., Duvenaud, D., Bowman, S., Tamkin, A., Perez, E., Sharma, M., Denison, C., and Hubinger, E. Simple probes can catch sleeper agents, 2024. URL https://www.anthropic.com/news/probes-catch-sleeper-agents
work page 2024
-
[41]
Malinin, A. and Gales, M. Uncertainty estimation in autoregressive structured prediction. ICLR, 2021
work page 2021
-
[42]
Manakul, P., Liusie, A., and Gales, M. J. Mqag: Multiple-choice question answering and generation for assessing information consistency in summarization. IJCNLP-AACL, 2023
work page 2023
-
[43]
Manakul, P., Liusie, A., and Gales, M. J. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[44]
Marks, S. and Tegmark, M. The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets. arXiv 2310.06824, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
On faithfulness and factuality in abstractive summarization
Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. On faithfulness and factuality in abstractive summarization. In ACL, 2020
work page 2020
-
[46]
Introducing meta llama 3: The most capable openly available llm to date, 2024
Meta. Introducing meta llama 3: The most capable openly available llm to date, 2024. URL https: //ai.meta.com/blog/meta-llama-3/. [Online; accessed June 16 2024]
work page 2024
-
[47]
J., Szlam, A., Boureau, Y .-L., and Dinan, E
Mielke, S. J., Szlam, A., Boureau, Y .-L., and Dinan, E. Reducing conversational agents’ overconfidence through linguistic calibration. TACL, 2022
work page 2022
-
[48]
W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H
Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.-t., Koh, P. W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. EMNLP, 2023
work page 2023
-
[49]
Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation
Mündler, N., He, J., Jenko, S., and Vechev, M. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv:2305.15852, 2023
-
[50]
Murray, K. and Chiang, D. Correcting length bias in neural machine translation. WMT, 2018. 12
work page 2018
-
[51]
Nan, F., Santos, C. N. d., Zhu, H., Ng, P., McKeown, K., Nallapati, R., Zhang, D., Wang, Z., Arnold, A. O., and Xiang, B. Improving factual consistency of abstractive summarization via question answering. ACL-IJCNLP, 2021
work page 2021
-
[52]
Fact finding: Attempting to reverse-engineer factual recall on the neuron level, Dec 2023
Nanda, N., Rajamanoharan, S., Kramar, J., and Shah, R. Fact finding: Attempting to reverse-engineer factual recall on the neuron level, Dec 2023. URL https://www.alignmentforum.org/posts/ iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
work page 2023
-
[53]
Opdahl, A. L., Tessem, B., Dang-Nguyen, D.-T., Motta, E., Setty, V ., Throndsen, E., Tverberg, A., and Trattner, C. Trustworthy journalism through AI. Data Knowl. Eng., 2023
work page 2023
- [54]
-
[55]
Future lens: Anticipating subsequent tokens from a single hidden state
Pal, K., Sun, J., Yuan, A., Wallace, B., and Bau, D. Future lens: Anticipating subsequent tokens from a single hidden state. In CoNLL, 2023
work page 2023
-
[56]
Scikit-learn: Machine learning in Python.JMLR, 12, 2011
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., et al. Scikit-learn: Machine learning in Python.JMLR, 12, 2011
work page 2011
-
[57]
Peng, B., Galley, M., He, P., Cheng, H., Xie, Y ., Hu, Y ., Huang, Q., Liden, L., Yu, Z., Chen, W., et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv:2302.12813, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y ., Miller, A. H., and Riedel, S. Language models as knowledge bases? EMNLP, 2019
work page 2019
-
[59]
Know what you don’t know: Unanswerable questions for squad.ACL, 2018
Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad.ACL, 2018
work page 2018
-
[60]
A Survey of Hallucination in Large Foundation Models
Rawte, V ., Sheth, A., and Das, A. A survey of hallucination in large foundation models.arXiv:2309.05922, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition. arXiv:2312.06681, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
How much knowledge can you pack into the parameters of a language model? EMNLP, 2020
Roberts, A., Raffel, C., and Shazeer, N. How much knowledge can you pack into the parameters of a language model? EMNLP, 2020
work page 2020
-
[63]
D., Reig, B., Shih, G., and Moy, L
Shen, Y ., Heacock, L., Elias, J., Hentel, K. D., Reig, B., Shih, G., and Moy, L. ChatGPT and other large language models are double-edged swords. Radiology, 2023
work page 2023
-
[64]
Shi, W., Han, X., Lewis, M., Tsvetkov, Y ., Zettlemoyer, L., and Yih, S. W.-t. Trusting your evidence: Hallucinate less with context-aware decoding. NAACL, 2023
work page 2023
-
[65]
Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Scharli, N., Chowdhery, A., Mansfield, P., y Arcas, B. A., Webster, D., Corrado, G. S., Matias, Y ., Chou, K., Gottweis, J., Tomasev, N., Liu, Y ., Rajkomar, A., Barral, J., Semturs, ...
work page 2023
-
[66]
Read before generate! faithful long form question answering with machine reading
Su, D., Li, X., Zhang, J., Shang, L., Jiang, X., Liu, Q., and Fung, P. Read before generate! faithful long form question answering with machine reading. ACL, 2022
work page 2022
-
[67]
Subramani, N., Suresh, N., and Peters, M. E. Extracting latent steering vectors from pretrained language models. ACL, 2022
work page 2022
-
[68]
Team, T. G. Gemini: a family of highly capable multimodal models. 2023
work page 2023
-
[69]
Tian, K., Mitchell, E., Yao, H., Manning, C. D., and Finn, C. Fine-tuning language models for factuality. ICLR, 2024
work page 2024
-
[71]
URL https://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers, M. R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almirantis, Y ., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artiéres, T., Ngomo, A.-C. N., Heino, N., Gaussier, E., Barrio-Alvers, L., Schroeder, M., Androutsopoulos, I., and Paliouras, G. An ove...
work page 2015
-
[74]
Turpin, M., Michael, J., Perez, E., and Bowman, S. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. NeurIPS, 2023
work page 2023
-
[75]
arXiv preprint arXiv:2307.03987 , year=
Varshney, N., Yao, W., Zhang, H., Chen, J., and Yu, D. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by actively validating low-confidence generation. arXiv 2307.03987, 2023
-
[76]
Asking and answering questions to evaluate the factual consistency of summaries
Wang, A., Cho, K., and Lewis, M. Asking and answering questions to evaluate the factual consistency of summaries. ACL, 2020
work page 2020
-
[77]
Lawyer who used ChatGPT faces penalty for made up citations
Weiser, B. Lawyer who used ChatGPT faces penalty for made up citations. The New York Times, June 2023
work page 2023
- [78]
-
[79]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Zhang, Y ., Li, Y ., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y ., Chen, Y ., et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv:2309.01219, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[80]
Representation Engineering: A Top-Down Approach to AI Transparency
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to ai transparency. arXiv:2310.01405, 2023. 14 A Additional Results Model Task Accuracies. We report the accuracies achieved by the models on the various datasets used in this work in Table 3...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.