Principled Detection of Hallucinations in Large Language Models via Multiple Testing

Akshayaa Magesh; Jiawei Li; Venugopal V. Veeravalli

arxiv: 2508.18473 · v3 · submitted 2025-08-25 · 💻 cs.CL · cs.AI· cs.LG

Principled Detection of Hallucinations in Large Language Models via Multiple Testing

Jiawei Li , Akshayaa Magesh , Venugopal V. Veeravalli This is my paper

Pith reviewed 2026-05-18 20:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords hallucination detectionlarge language modelsmultiple testingconformal p-valueshypothesis testingout-of-distribution detectioncalibrated detection

0 comments

The pith

A multiple-testing method aggregates evaluation scores into conformal p-values to detect LLM hallucinations while controlling the false alarm rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames hallucination detection as a hypothesis testing task, similar to spotting out-of-distribution inputs. It introduces a procedure that converts several existing scoring rules into p-values and combines them through conformal calibration. This produces a detector that maintains a user-specified false alarm rate across different models and data sets. A reader would care because prior detectors rely on single empirical scores whose reliability shifts unpredictably, leaving practitioners without a principled way to trust or combine them.

Core claim

By casting hallucination detection as a multiple hypothesis testing problem and applying conformal p-values to aggregate diverse evaluation scores, the method yields a calibrated detector whose false alarm rate can be controlled at a pre-specified level without requiring distributional assumptions on the scores themselves.

What carries the argument

Conformal p-values that turn multiple evaluation scores into a single calibrated test statistic for the null hypothesis of no hallucination.

If this is right

Detection decisions become reliable enough to set explicit error budgets in production systems.
Any new scoring rule can be added to the pool without redesigning the overall procedure.
The same framework applies to other generative tasks that admit multiple cheap evaluation scores.
Performance remains stable when the underlying LLM or the input domain changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to streaming settings where responses arrive sequentially and calibration is updated on the fly.
It offers a route to combine human feedback signals with automatic scores under a single error-control guarantee.
Similar conformal aggregation might improve detection of other model failures such as factual inconsistency or unsafe outputs.

Load-bearing premise

Each individual evaluation score can be turned into a valid p-value under the assumption of no hallucination, and the conformal procedure works directly on LLM outputs.

What would settle it

A set of held-out responses where the method's observed false positive rate exceeds the target level across several models and data sets would show the calibration has failed.

read the original abstract

While Large Language Models (LLMs) have emerged as powerful foundational models to solve a variety of tasks, they have also been shown to be prone to hallucinations, i.e., generating responses that sound confident but are actually incorrect or even nonsensical. Existing hallucination detectors propose a wide range of empirical scoring rules, but their performance varies across models and datasets, and it is hard to determine which ones to rely on in practice or to treat as a reliable detector. In this work, we formulate the problem of detecting hallucinations as a hypothesis testing problem and draw parallels with the problem of out-of-distribution detection in machine learning models. We then propose a multiple-testing-inspired method that systematically aggregates multiple evaluation scores via conformal p-values, enabling calibrated detection with controlled false alarm rate. Extensive experiments across diverse models and datasets validate the robustness of our approach against state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They frame hallucination detection as multiple testing and aggregate scores via conformal p-values for false-alarm control, but the key assumptions on score uniformity for LLM outputs are not clearly verified.

read the letter

The main point is that this paper treats hallucination detection as a hypothesis testing problem and uses conformal p-values to combine several existing scores while aiming for a controlled false positive rate. That is the concrete advance over just running another empirical detector and hoping it works on the next model or dataset. They draw the parallel to out-of-distribution detection and then show experiments across multiple LLMs and benchmarks where the aggregated procedure looks more stable than the individual baselines. That part is useful for anyone who has to pick which score to trust in practice. The experiments give some evidence that the method is not brittle to model or data choice, which is worth having on record. The soft spot is the validity of the p-values themselves. Conformal methods need the scores to produce roughly uniform p-values under the null of no hallucination and the calibration and test points to be exchangeable. LLM generations from the same model can introduce dependence through shared token statistics, and the null examples are still model outputs rather than external references. The abstract does not mention any direct check such as p-value histograms on held-out non-hallucinated responses, so it is not obvious that the guarantee transfers cleanly. If the full paper has those diagnostics, the concern shrinks; otherwise it stays material. This work is aimed at people building or auditing reliable LLM systems who already know the usual scoring tricks and want a statistical wrapper around them. A reader working on calibration or trustworthy generation would get practical value. It is solid enough to deserve a serious referee rather than a desk reject, mainly because the idea is straightforward to implement and the empirical scope is reasonable. I would send it out for review but flag the need to see explicit uniformity or exchangeability checks in the experiments.

Referee Report

2 major / 2 minor

Summary. The paper frames hallucination detection in LLMs as a hypothesis-testing problem analogous to out-of-distribution detection. It introduces a multiple-testing procedure that converts multiple evaluation scores into conformal p-values and aggregates them to produce a detector with explicit control over the false-alarm rate. The approach is claimed to be robust across models and datasets, outperforming existing empirical scoring rules.

Significance. A method that supplies finite-sample false-alarm guarantees via conformal calibration would be a useful advance over ad-hoc detectors, provided the exchangeability and uniformity assumptions hold for LLM outputs. The absence of parameter-free derivations or machine-checked proofs limits the immediate theoretical impact, but successful validation could still influence practical deployment of reliable LLM systems.

major comments (2)

[Method section (p-value construction)] The central guarantee rests on converting evaluation scores into marginally uniform p-values under the null of 'no hallucination' and on exchangeability between calibration and test points. The manuscript does not report a uniformity diagnostic (e.g., QQ-plot or Kolmogorov-Smirnov test) on held-out non-hallucinated generations, leaving open whether dependence induced by shared token distributions violates the conformal coverage guarantee.
[Experimental results] Table or figure reporting empirical false-alarm rates: the experiments claim controlled false-alarm rates, yet without an explicit comparison of observed versus nominal rates on a held-out non-hallucinated test set, it is impossible to verify that the multiple-testing aggregation actually delivers the advertised calibration.

minor comments (2)

[Method] Notation for the aggregated conformal p-value and the multiple-testing threshold should be introduced with a single, self-contained equation rather than scattered across paragraphs.
[Abstract and results] The abstract states that performance 'varies across models and datasets' but does not quantify the improvement (e.g., average F1 or false-alarm reduction) relative to the strongest baseline; adding one summary row to the main results table would help readers assess practical gain.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and the focus on the validity of the conformal guarantees. We respond to each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Method section (p-value construction)] The central guarantee rests on converting evaluation scores into marginally uniform p-values under the null of 'no hallucination' and on exchangeability between calibration and test points. The manuscript does not report a uniformity diagnostic (e.g., QQ-plot or Kolmogorov-Smirnov test) on held-out non-hallucinated generations, leaving open whether dependence induced by shared token distributions violates the conformal coverage guarantee.

Authors: We agree that a uniformity diagnostic strengthens the exposition. In the revised manuscript we include QQ-plots together with Kolmogorov-Smirnov test statistics computed on held-out non-hallucinated generations for each dataset and model. These checks support approximate marginal uniformity. On the exchangeability question, the conformal coverage guarantee requires only that the calibration and test scores are exchangeable under the null; our protocol draws independent generations from the identical model and prompt distribution and uses disjoint calibration and test splits, thereby preserving the required exchangeability. Dependence induced by shared token statistics is therefore already accounted for by the exchangeability assumption. revision: yes
Referee: [Experimental results] Table or figure reporting empirical false-alarm rates: the experiments claim controlled false-alarm rates, yet without an explicit comparison of observed versus nominal rates on a held-out non-hallucinated test set, it is impossible to verify that the multiple-testing aggregation actually delivers the advertised calibration.

Authors: We accept the suggestion. The revised manuscript now contains a dedicated figure that plots observed false-alarm rates against the nominal levels (0.05 and 0.10) on held-out non-hallucinated test sets across all evaluated LLMs and datasets. The empirical rates track the nominal thresholds closely, confirming that the multiple-testing aggregation maintains the advertised calibration. revision: yes

Circularity Check

0 steps flagged

No circularity: method applies external conformal framework to new task

full rationale

The paper formulates hallucination detection as hypothesis testing and applies conformal p-values to aggregate existing evaluation scores. This is an application of established multiple-testing and conformal prediction techniques rather than a self-referential derivation. No equations reduce a claimed prediction to a fitted input by construction, no load-bearing self-citation chain is invoked for uniqueness, and the central aggregation procedure introduces independent content validated on external benchmarks. The derivation remains self-contained against the stated assumptions without tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; full text would be required to enumerate free parameters, axioms, or invented entities. No explicit free parameters or new entities are named in the provided abstract.

pith-pipeline@v0.9.0 · 5690 in / 1105 out tokens · 25270 ms · 2026-05-18T20:41:50.964261+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a multiple-testing-inspired method that systematically aggregates multiple evaluation scores via conformal p-values, enabling calibrated detection with controlled false alarm rate.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the conformal p-values ... q_j^con := 1 + |{i : s_j^i >= t_j^test}| / (1 + |C|)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2307.13528 , year =

I Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu, et al. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528,

work page arXiv
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Retrieving supporting evidence for llms generated answers

Siqing Huo, Negar Arabzadeh, and Charles LA Clarke. Retrieving supporting evidence for llms generated answers. arXiv preprint arXiv:2306.13781,

work page arXiv
[6]

Mistral 7B

URL https://arxiv.org/ abs/2310.06825. Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. arXiv e-prints, art. arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Addressing uncertainty in llms to enhance reliability in generative ai

Ramneet Kaur, Colin Samplawski, Adam D Cobb, Anirban Roy, Brian Matejek, Manoj Acharya, Daniel Elenius, Alexander M Berenbeim, John A Pavlik, Nathaniel D Bastian, et al. Addressing uncertainty in llms to enhance reliability in generative ai. arXiv preprint arXiv:2411.02381,

work page arXiv
[8]

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[9]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2305.19187 , year=

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantifica- tion for black-box large language models. arXiv preprint arXiv:2305.19187,

work page arXiv
[11]

Chatgpt as a factual inconsistency evaluator for text summarization

Zheheng Luo, Qianqian Xie, and Sophia Ananiadou. Chatgpt as a factual inconsistency evaluator for text summarization. arXiv preprint arXiv:2303.15621,

work page arXiv
[12]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Selfcheckagent: Zero-resource hallucination detection in generative large language models

Diyana Muhammed, Gollam Rabby, and Sören Auer. Selfcheckagent: Zero-resource hallucination detection in generative large language models. arXiv preprint arXiv:2502.01812,

work page arXiv
[14]

Entity cloze by date: What lms know about unseen entities

Yasumasa Onoe, Michael JQ Zhang, Eunsol Choi, and Greg Durrett. Entity cloze by date: What lms know about unseen entities. arXiv preprint arXiv:2205.02832,

work page arXiv
[15]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online, June

work page 2021
[16]

doi: 10.18653/v1/2021.naacl-main

Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main

work page doi:10.18653/v1/2021.naacl-main 2021
[17]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu

URL https://aclanthology.org/2021.naacl-main.168. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67,

work page 2021
[18]

Weakly supervised detection of hallucinations in llm activations

Miriam Rateike, Celia Cintas, John Wamburu, Tanya Akumu, and Skyler Speakman. Weakly supervised detection of hallucinations in llm activations. arXiv preprint arXiv:2312.02798,

work page arXiv
[19]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation

Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987,

work page arXiv
[23]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Hallucination detection in large language models with metamorphic relations

Borui Yang, Md Afif Al Mamun, Jie M Zhang, and Gias Uddin. Hallucination detection in large language models with metamorphic relations. arXiv preprint arXiv:2502.15844,

work page arXiv
[25]

LLM lies: Hallucinations are not bugs, but features as adversarial examples,

Jia-Yu Yao, Kun-Peng Ning, Zhen-Hui Liu, Mu-Nan Ning, Yu-Yang Liu, and Li Yuan. Llm lies: Hallucinations are not bugs, but features as adversarial examples.arXiv preprint arXiv:2310.01469,

work page arXiv
[26]

arXiv preprint arXiv:2305.13534 , year=

Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534,

work page arXiv
[27]

Detecting hallucinated content in conditional neural sequence generation

Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, Paco Guzman, Luke Zettlemoyer, and Marjan Ghazvininejad. Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593,

work page arXiv 2011
[28]

Furthermore, there is no clear advantage in detection power or AUROC when using the 5, 000-point calibration dataset compared to the 3, 000-point version in the main text, suggesting that performance may benefit from including calibration data sampled from other distributions. On TriviaQA, both detection power and AUROC do not show significant improvement...

work page 2024

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2307.13528 , year =

I Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu, et al. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528,

work page arXiv

[3] [3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Retrieving supporting evidence for llms generated answers

Siqing Huo, Negar Arabzadeh, and Charles LA Clarke. Retrieving supporting evidence for llms generated answers. arXiv preprint arXiv:2306.13781,

work page arXiv

[6] [6]

Mistral 7B

URL https://arxiv.org/ abs/2310.06825. Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. arXiv e-prints, art. arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Addressing uncertainty in llms to enhance reliability in generative ai

Ramneet Kaur, Colin Samplawski, Adam D Cobb, Anirban Roy, Brian Matejek, Manoj Acharya, Daniel Elenius, Alexander M Berenbeim, John A Pavlik, Nathaniel D Bastian, et al. Addressing uncertainty in llms to enhance reliability in generative ai. arXiv preprint arXiv:2411.02381,

work page arXiv

[8] [8]

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858,

work page internal anchor Pith review Pith/arXiv arXiv 1909

[9] [9]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

arXiv preprint arXiv:2305.19187 , year=

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantifica- tion for black-box large language models. arXiv preprint arXiv:2305.19187,

work page arXiv

[11] [11]

Chatgpt as a factual inconsistency evaluator for text summarization

Zheheng Luo, Qianqian Xie, and Sophia Ananiadou. Chatgpt as a factual inconsistency evaluator for text summarization. arXiv preprint arXiv:2303.15621,

work page arXiv

[12] [12]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Selfcheckagent: Zero-resource hallucination detection in generative large language models

Diyana Muhammed, Gollam Rabby, and Sören Auer. Selfcheckagent: Zero-resource hallucination detection in generative large language models. arXiv preprint arXiv:2502.01812,

work page arXiv

[14] [14]

Entity cloze by date: What lms know about unseen entities

Yasumasa Onoe, Michael JQ Zhang, Eunsol Choi, and Greg Durrett. Entity cloze by date: What lms know about unseen entities. arXiv preprint arXiv:2205.02832,

work page arXiv

[15] [15]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online, June

work page 2021

[16] [16]

doi: 10.18653/v1/2021.naacl-main

Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main

work page doi:10.18653/v1/2021.naacl-main 2021

[17] [17]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu

URL https://aclanthology.org/2021.naacl-main.168. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67,

work page 2021

[18] [18]

Weakly supervised detection of hallucinations in llm activations

Miriam Rateike, Celia Cintas, John Wamburu, Tanya Akumu, and Skyler Speakman. Weakly supervised detection of hallucinations in llm activations. arXiv preprint arXiv:2312.02798,

work page arXiv

[19] [19]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation

Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987,

work page arXiv

[23] [23]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Hallucination detection in large language models with metamorphic relations

Borui Yang, Md Afif Al Mamun, Jie M Zhang, and Gias Uddin. Hallucination detection in large language models with metamorphic relations. arXiv preprint arXiv:2502.15844,

work page arXiv

[25] [25]

LLM lies: Hallucinations are not bugs, but features as adversarial examples,

Jia-Yu Yao, Kun-Peng Ning, Zhen-Hui Liu, Mu-Nan Ning, Yu-Yang Liu, and Li Yuan. Llm lies: Hallucinations are not bugs, but features as adversarial examples.arXiv preprint arXiv:2310.01469,

work page arXiv

[26] [26]

arXiv preprint arXiv:2305.13534 , year=

Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534,

work page arXiv

[27] [27]

Detecting hallucinated content in conditional neural sequence generation

Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, Paco Guzman, Luke Zettlemoyer, and Marjan Ghazvininejad. Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593,

work page arXiv 2011

[28] [28]

Furthermore, there is no clear advantage in detection power or AUROC when using the 5, 000-point calibration dataset compared to the 3, 000-point version in the main text, suggesting that performance may benefit from including calibration data sampled from other distributions. On TriviaQA, both detection power and AUROC do not show significant improvement...

work page 2024