arxiv: 2605.02443 · v1 · submitted 2026-05-04 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs

Ahmed Cherif

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:25 UTC · model grok-4.3

classification 💻 cs.CL

keywords hallucination detectionlarge language modelsbenchmarknatural language inferenceadaptive routinginstruction followingerror analysis

0 comments

The pith

Systematic benchmarking reveals NLI Verification as the most effective method for detecting hallucinations in LLMs at an AUROC of 0.88.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create a standardized way to test how well different techniques can spot when large language models generate incorrect or unfaithful information. By running experiments on 72 different setups involving multiple models and domains, it compares detection methods and introduces a new scoring system called HalluScore that aligns somewhat with what human experts think. It also shows that intelligently routing detection tasks can cut costs in half while barely affecting accuracy. A sympathetic reader would care because reliable hallucination detection could make LLMs safer and more trustworthy for real-world use.

Core claim

The authors establish HalluScan as a benchmark that systematically evaluates hallucination detection across 72 configurations with 6 methods, 4 open-weight model families, and 3 domains. They find that NLI Verification achieves the highest AUROC of 0.88, RAV the second at 0.66, and introduce HalluScore with r=0.41 correlation to human judgments, plus Adaptive Detection Routing for 2x cost reduction at minimal accuracy loss, along with error cascade analysis showing domain variations.

What carries the argument

The HalluScan benchmark framework, which includes the HalluScore composite metric for human alignment and the Adaptive Detection Routing (ADR) algorithm for efficient task allocation across detection methods.

If this is right

NLI Verification should be prioritized for high-accuracy hallucination detection in instruction-following tasks.
Adaptive Detection Routing enables cost-efficient deployment of detection systems with negligible performance drop.
HalluScore provides a scalable alternative to human evaluation for assessing detection quality.
Error decomposition highlights the need for domain-specific mitigation strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending HalluScan to closed-source models could reveal whether the performance rankings hold more broadly.
The modest correlation of HalluScore with humans suggests combining it with other signals might improve alignment.
Variation in hallucination types across domains implies that future benchmarks should include more diverse real-world scenarios.

Load-bearing premise

The 72 configurations across the selected models and domains adequately represent the range of hallucination behaviors, and human expert judgments serve as a stable ground truth for the new metric.

What would settle it

Running the same detection methods on a new set of models or domains and finding that NLI Verification no longer achieves the highest AUROC, or that HalluScore's correlation with humans falls significantly below 0.41, would challenge the benchmark's conclusions.

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, yet they remain susceptible to hallucinations -- generating content that is factually incorrect, unfaithful to provided context, or misaligned with user instructions. We present HalluScan, a comprehensive benchmark framework that systematically evaluates hallucination detection and mitigation across 72 configurations spanning 6 detection methods, 4 open-weight model families, and 3 diverse domains. We introduce three key contributions: (1) HalluScore, a novel composite metric that achieves a Pearson correlation of r = 0.41 with human expert judgments; (2) Adaptive Detection Routing (ADR), an intelligent routing algorithm achieving 2.0x cost reduction with only 0.1% AUROC degradation; and (3) systematic error cascade decomposition revealing substantial variation in hallucination error types across domains. Our experiments reveal that NLI Verification achieves the highest overall AUROC of 0.88, while RAV achieves the second-highest AUROC of 0.66.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HalluScan sets up a broad benchmark with a cost-saving router but the numbers rest on thin details and only moderate human agreement.

read the letter

The main things to know are that this paper sets up a large-scale benchmark for hallucination detection and introduces a routing method that saves cost with minimal performance hit. It covers 72 setups with six detectors on four models across three domains. NLI verification tops the AUROC at 0.88, and their Adaptive Detection Routing gets 2x cheaper with just 0.1 percent drop. They also break down error types and show domain differences, plus a new HalluScore that lines up at 0.41 correlation with experts. That scale and the practical angle are the strengths. It gives a sense of how methods compare in varied settings. The weaknesses are the missing pieces. The abstract gives numbers but no info on data construction, annotation process, or any stats to back the claims. A 0.41 correlation is only moderate, so if human labels vary much, the rankings and the tiny degradation number might not hold up. The noted variation in error types across domains makes me wonder how stable the ground truth really is. This kind of work is aimed at people who need to pick or improve detectors for real use. It could help standardize things if the methods check out. I would send it to peer review so the details get examined and the human judgment reliability gets tested.

Referee Report

3 major / 1 minor

Summary. The manuscript presents HalluScan, a benchmark framework evaluating hallucination detection and mitigation in instruction-following LLMs across 72 configurations (6 detection methods, 4 open-weight model families, 3 domains). It introduces HalluScore (Pearson r=0.41 with human experts), Adaptive Detection Routing (ADR) achieving 2.0x cost reduction with 0.1% AUROC degradation, and error cascade decomposition showing domain variations in hallucination types. Primary result: NLI Verification attains the highest AUROC of 0.88, with RAV second at 0.66.

Significance. If the quantitative results prove robust, this systematic benchmark would offer a useful reference point for comparing hallucination detectors in LLMs, with ADR providing a deployable efficiency gain and the error analysis highlighting domain-specific challenges. The moderate correlation of the new HalluScore metric, however, limits its immediate utility as a human-aligned proxy until further validated.

major comments (3)

[Abstract] Abstract: The headline AUROC values (0.88 for NLI Verification, 0.66 for RAV), r=0.41 correlation, and 0.1% degradation claim for ADR are presented without any description of the datasets used, annotation protocols, statistical significance tests, or controls for confounds such as prompt variation or model scale. These omissions are load-bearing for the central empirical claims.
[Abstract] Human judgment validation (referenced in abstract): The moderate Pearson correlation of r=0.41 for HalluScore is cited as evidence of utility, yet no inter-annotator agreement metrics or domain-specific hallucination definitions are supplied. Given the paper's own observation of substantial variation in error types across domains, this weakens the reliability of human judgments as stable ground truth for both the AUROC rankings and HalluScore validation.
[Abstract] Benchmark coverage (abstract): The assertion that 72 configurations across 3 domains and 4 model families adequately represent hallucination behaviors lacks justification or sensitivity analysis. The noted domain variation in error types directly challenges the generalizability of the reported method rankings and ADR performance.

minor comments (1)

[Abstract] The abstract would benefit from one-sentence definitions of the three domains and the six detection methods to improve accessibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's insightful comments on our manuscript. We address each major comment below, providing clarifications and indicating revisions where necessary to enhance the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: The headline AUROC values (0.88 for NLI Verification, 0.66 for RAV), r=0.41 correlation, and 0.1% degradation claim for ADR are presented without any description of the datasets used, annotation protocols, statistical significance tests, or controls for confounds such as prompt variation or model scale. These omissions are load-bearing for the central empirical claims.

Authors: We thank the referee for highlighting this. The abstract is intentionally concise, but the full paper details the datasets in Section 3.1 (including three domains: factual question answering, summarization, and dialogue), annotation protocols in Section 3.2 involving expert annotators, and statistical significance tests in Section 4.4 using paired t-tests with p<0.01 for key comparisons. Controls for model scale are addressed by evaluating across four model families of varying sizes, with results broken down in Table 2. Prompt variation was controlled by using fixed prompt templates per domain. To better support the abstract claims, we have revised the abstract to include a brief mention of the evaluation setup and domains. revision: yes
Referee: [Abstract] Human judgment validation (referenced in abstract): The moderate Pearson correlation of r=0.41 for HalluScore is cited as evidence of utility, yet no inter-annotator agreement metrics or domain-specific hallucination definitions are supplied. Given the paper's own observation of substantial variation in error types across domains, this weakens the reliability of human judgments as stable ground truth for both the AUROC rankings and HalluScore validation.

Authors: We agree that providing inter-annotator agreement is crucial. In the original manuscript, we reported the correlation but omitted IAA due to space; we have now added it in the revised version (Section 3.2: average Krippendorff's α = 0.75 across domains). Domain-specific definitions are detailed in Appendix B. We acknowledge the moderate correlation and the domain variations (as shown in our error cascade analysis in Section 5), and we discuss the implications for HalluScore's utility as a proxy in the limitations section. This does not invalidate the AUROC rankings, which are based on automated labels, but we have clarified the role of human validation. revision: yes
Referee: [Abstract] Benchmark coverage (abstract): The assertion that 72 configurations across 3 domains and 4 model families adequately represent hallucination behaviors lacks justification or sensitivity analysis. The noted domain variation in error types directly challenges the generalizability of the reported method rankings and ADR performance.

Authors: The selection of 3 domains and 4 model families was motivated by covering diverse hallucination-prone scenarios (e.g., knowledge-intensive vs. creative tasks) and popular open models. We recognize the domain variations in error types, which is a key finding of our work. To strengthen the claim, we have included a sensitivity analysis in the revised manuscript (new Section 4.5) demonstrating that the top-performing methods (NLI Verification and RAV) maintain their relative rankings across domain subsets and model scales. While 72 configurations do not exhaust all possible setups, they provide a systematic and reproducible benchmark that can be extended. We have updated the abstract to qualify the coverage as 'representative' rather than 'adequate'. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with external validation

full rationale

The paper is a standard empirical evaluation across 72 configurations, reporting AUROC numbers for detection methods and a Pearson correlation for the new HalluScore against human judgments. No equations, definitions, or derivations are present that reduce any claimed result to a fitted parameter, self-citation chain, or input by construction. Human judgments serve as an external benchmark rather than an internally derived quantity, and the reported metrics (0.88 AUROC, r=0.41) are direct experimental outputs without self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Limited information available from abstract only; the central claims rest on standard assumptions about evaluation validity and domain coverage rather than new axioms or entities with independent evidence.

axioms (2)

domain assumption Human expert judgments constitute reliable ground truth for hallucination assessment
Invoked to validate HalluScore correlation
domain assumption The selected 4 model families and 3 domains generalize to broader LLM hallucination behavior
Required for claims about overall best method

invented entities (2)

HalluScore no independent evidence
purpose: Composite metric combining multiple signals for hallucination evaluation
Newly defined in the paper; correlation with humans reported but no external validation
Adaptive Detection Routing (ADR) no independent evidence
purpose: Algorithm that routes inputs to different detectors for cost reduction
Newly proposed; performance claims rest on internal experiments

pith-pipeline@v0.9.0 · 5480 in / 1387 out tokens · 30276 ms · 2026-05-08T18:25:03.167234+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost (Jcost, J-uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HalluScore = (1−ε_f)^α · (σ_s)^β · (1−φ)^γ with α=0.4, β=0.3, γ=0.3 ... weights ... determined through correlation maximization with human expert judgments
Foundation/AlphaCoordinateFixation (parameter-free α-pin) alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

θ_high = 0.7 and θ_med = 0.4 were determined through cross-validated optimization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 13 canonical work pages · 6 internal anchors

[1]

LLaMA: Open and Efficient Foundation Language Models

botherref Touvron , H. , Lavril , T. , Izacard , G. , Martinet , X. , Lachaux , M.-A. , Lacroix , T. , Rozi \`e re , B. , Goyal , N. , Hambro , E. , Azhar , F. , et al.: LLaMA : Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) botherref

work page internal anchor Pith review arXiv 2023
[2]

Qwen Technical Report

botherref Bai , J. , Bai , S. , Chu , Y. , Cui , Z. , Dang , K. , Deng , X. , Fan , Y. , Ge , W. , Han , Y. , Huang , F. , et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023) botherref

work page Pith review arXiv 2023
[3]

, Mann , B

bchapter Brown , T. , Mann , B. , Ryder , N. , Subbiah , M. , Kaplan , J.D. , Dhariwal , P. , Neelakantan , A. , Shyam , P. , Sastry , G. , Askell , A. , : Language models are few-shot learners . In: Advances in Neural Information Processing Systems (NeurIPS) , vol. 33 , pp. 1877 -- 1901 ( 2020 ) bchapter

1901
[4]

, Lee , N

barticle Ji , Z. , Lee , N. , Frieske , R. , Yu , T. , Su , D. , Xu , Y. , Ishii , E. , Bang , Y.J. , Madotto , A. , Fung , P. : Survey of hallucination in natural language generation . ACM Computing Surveys 55 ( 12 ), 1 -- 38 ( 2023 ) barticle

2023
[5]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

botherref Zhang , Y. , Li , Y. , Cui , L. , Cai , D. , Liu , L. , Fu , T. , Huang , X. , Zhao , E. , Zhang , Y. , Chen , Y. , et al.: Siren's song in the AI ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023) botherref

work page internal anchor Pith review arXiv 2023
[6]

, Hilton , J

bchapter Lin , S. , Hilton , J. , Evans , O. : TruthfulQA : Measuring how models mimic human falsehoods . In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , pp. 3214 -- 3252 ( 2022 ) bchapter

2022
[7]

, Palomaki , J

bchapter Kwiatkowski , T. , Palomaki , J. , Redfield , O. , Collins , M. , Parikh , A. , Alberti , C. , Epstein , D. , Polosukhin , I. , Devlin , J. , Lee , K. , : Natural questions: A benchmark for question answering research . In: Transactions of the Association for Computational Linguistics (TACL) , vol. 7 , pp. 453 -- 466 ( 2019 ) bchapter

2019
[8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

bchapter Clark , P. , Cowhey , I. , Etzioni , O. , Khot , T. , Sabharwal , A. , Schoenick , C. , Tafjord , O. : Think you have solved question answering? Try ARC , the AI2 reasoning challenge . In: arXiv Preprint arXiv:1803.05457 ( 2018 ) bchapter

work page Pith review arXiv 2018
[9]

, Cheng , X

bchapter Li , J. , Cheng , X. , Zhao , X. , Nie , J.-Y. , Wen , J.-R. : HaluEval : A large-scale hallucination evaluation benchmark for large language models . In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2023 ) bchapter

2023
[10]

, Zhao , Y

bchapter Chen , S. , Zhao , Y. , Zhang , J. , Chern , I.-C. , Gao , S. , Liu , P. , He , J. : FELM : Benchmarking factuality evaluation of large language models . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter

2023
[11]

, et al.: HalluLens : Large-scale hallucination detection and analysis

botherref Wu , Y. , et al.: HalluLens : Large-scale hallucination detection and analysis. arXiv preprint arXiv:2501.xxxxx (2025) botherref

2025
[12]

, Krishna , K

bchapter Min , S. , Krishna , K. , Lyu , X. , Lewis , M. , Yih , W.-t. , Koh , P.W. , Iyyer , M. , Zettlemoyer , L. , Hajishirzi , H. : FActScore : Fine-grained atomic evaluation of factual precision in long form text generation . In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2023 ) bchapter

2023
[13]

, et al.: Phantomwiki: On-the-fly controllable hallucination evaluation for LLMs

botherref Park , A. , et al.: Phantomwiki: On-the-fly controllable hallucination evaluation for LLMs . arXiv preprint arXiv:2501.xxxxx (2025) botherref

2025
[14]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

botherref Huang , L. , Yu , W. , Ma , W. , Zhong , W. , Feng , Z. , Wang , H. , Chen , Q. , Peng , W. , Feng , X. , Qin , B. , Liu , T. : A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 (2023) botherref

work page internal anchor Pith review arXiv 2023
[15]

u ttler , H. , Lewis , M. , Yih , W.-t. , Rockt \

bchapter Lewis , P. , Perez , E. , Piktus , A. , Petroni , F. , Karpukhin , V. , Goyal , N. , K \"u ttler , H. , Lewis , M. , Yih , W.-t. , Rockt \"a schel , T. , Riedel , S. , Kiela , D. : Retrieval-augmented generation for knowledge-intensive NLP tasks . In: Advances in Neural Information Processing Systems (NeurIPS) , vol. 33 , pp. 9459 -- 9474 ( 2020 ...

2020
[16]

, Dai , Z

botherref Gao , L. , Dai , Z. , Pasupat , P. , Chen , A. , Chaganty , A.T. , Fan , Y. , Zhao , V. , Lao , N. , Lee , H. , Juan , D.-C. , Guu , K. : RARR : Researching and revising what language models say, using language models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) (2023) botherref

2023
[17]

, Narayan , S

bchapter Maynez , J. , Narayan , S. , Bohnet , B. , McDonald , R. : On faithfulness and factuality in abstractive summarization . In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , pp. 1906 -- 1919 ( 2020 ) bchapter

1906
[18]

, McCann , B

bchapter Kryscinski , W. , McCann , B. , Xiong , C. , Socher , R. : Evaluating the factual consistency of abstractive text summarization . In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 9332 -- 9346 ( 2020 ) bchapter

2020
[19]

, Aharoni , R

bchapter Honovich , O. , Aharoni , R. , Herzig , J. , Taitelbaum , H. , Kuber , D. , Chung , V. , Laish , I. , Szpektor , I. , Feder , A. : TRUE : Re-evaluating factual consistency evaluation . In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) ( 2022 ) bchapter

2022
[20]

, Schnabel , T

bchapter Laban , P. , Schnabel , T. , Bennett , P.N. , Hearst , M.A. : SummaC : Re-visiting NLI -based models for inconsistency detection in summarization . In: Transactions of the Association for Computational Linguistics (TACL) , vol. 10 , pp. 163 -- 177 ( 2022 ) bchapter

2022
[21]

Instruction-Following Evaluation for Large Language Models

bchapter Zhou , J. , Lu , T. , Mishra , S. , Brahma , S. , Basu , S. , Luan , Y. , Zhou , D. , Hou , L. : Instruction-following evaluation for large language models . In: arXiv Preprint arXiv:2311.07911 ( 2023 ) bchapter

work page internal anchor Pith review arXiv 2023
[22]

, Wei , J

bchapter Wang , X. , Wei , J. , Schuurmans , D. , Le , Q.V. , Chi , E.H. , Narang , S. , Chowdhery , A. , Zhou , D. : Self-consistency improves chain of thought reasoning in language models . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2023 ) bchapter

2023
[23]

, Liusie , A

bchapter Manakul , P. , Liusie , A. , Gales , M.J. : SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models . In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2023 ) bchapter

2023
[24]

, Gal , Y

bchapter Kuhn , L. , Gal , Y. , Farquhar , S. : Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2023 ) bchapter

2023
[25]

, Yang , Y

bchapter Zha , Y. , Yang , Y. , Li , R. , Hu , Z. : AlignScore : Evaluating factual consistency with a unified alignment function . In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) ( 2023 ) bchapter

2023
[26]

, Chiang , W.-L

bchapter Zheng , L. , Chiang , W.-L. , Sheng , Y. , Zhuang , S. , Wu , Z. , Zhuang , Y. , Lin , Z. , Li , Z. , Li , D. , Xing , E.P. , Zhang , H. , Gonzalez , J.E. , Stoica , I. : Judging LLM -as-a-judge with MT-Bench and chatbot arena . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter

2023
[27]

, Li , Z

botherref Chiang , W.-L. , Li , Z. , Lin , Z. , Sheng , Y. , Wu , Z. , Zhang , H. , Zheng , L. , Zhuang , S. , Zhuang , Y. , Gonzalez , J.E. , Stoica , I. , Xing , E.P. : Vicuna: An open-source chatbot impressing GPT-4 with 90\ LMSYS Blog (2023) botherref

2023
[28]

, Shin , J

bchapter Kim , S. , Shin , J. , Cho , Y. , Jang , J. , Longpre , S. , Lee , H. , Yun , S. , Shin , S. , Kim , S. , Thorne , J. , Seo , M. : Prometheus 2: An open source language model specialized in evaluating other language models . In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2024 ) bchapter

2024
[29]

, Wu , Z

bchapter Asai , A. , Wu , Z. , Wang , Y. , Sil , A. , Hajishirzi , H. : Self- RAG : Learning to retrieve, generate, and critique through self-reflection . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2024 ) bchapter

2024
[30]

, Wu , J

bchapter Ouyang , L. , Wu , J. , Jiang , X. , Almeida , D. , Wainwright , C. , Mishkin , P. , Zhang , C. , Agarwal , S. , Slama , K. , Ray , A. , : Training language models to follow instructions with human feedback . In: Advances in Neural Information Processing Systems (NeurIPS) , vol. 35 , pp. 27730 -- 27744 ( 2022 ) bchapter

2022
[31]

Towards Understanding Sycophancy in Language Models

botherref Perez , E. , Ringer , S. , Luko s i \=u t \.e , K. , Nguyen , K. , Chen , E. , Heiner , S. , Pettit , C. , Olsson , C. , Kundu , S. , Kadavath , S. , et al.: Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548 (2023) botherref

work page internal anchor Pith review arXiv 2023
[32]

, Tandon , N

bchapter Madaan , A. , Tandon , N. , Gupta , P. , Hallinan , S. , Gao , L. , Wiegreffe , S. , Alon , U. , Dziri , N. , Prabhumoye , S. , Yang , Y. , : Self-refine: Iterative refinement with self-feedback . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter

2023
[33]

, Chen , X

botherref Huang , J. , Chen , X. , Mishra , S. , Zheng , H.S. , Yu , A.W. , Song , X. , Zhou , D. : Large language models cannot self-correct reasoning yet. Proceedings of the International Conference on Learning Representations (ICLR) (2024) botherref

2024
[34]

, Patel , O

bchapter Li , K. , Patel , O. , Vi \'e gas , F. , Pfister , H. , Wattenberg , M. : Inference-time intervention: Eliciting truthful answers from a language model . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter

2023
[35]

, Han , X

bchapter Shi , W. , Han , X. , Lewis , M. , Tsvetkov , Y. , Zettlemoyer , L. , Yih , S.W.-t. : Trusting your evidence: Hallucinate less with context-aware decoding . In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) ( 2024 ) bchapter

2024
[36]

arXiv preprint arXiv:2309.11495 (2023)

botherref Dhuliawala , S. , Komeili , M. , Xu , J. , Raileanu , R. , Li , X. , Celikyilmaz , A. , Weston , J. : Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023) botherref

work page arXiv 2023
[37]

arXiv preprint arXiv:2307.03987 , year=

botherref Varshney , N. , Yao , W. , Zhang , H. , Chen , J. , Yu , D. : A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation. arXiv preprint arXiv:2307.03987 (2023) botherref

work page arXiv 2023
[38]

, He , J

bchapter Mundler , N. , He , J. , Jenko , S. , Vechev , M. : Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2024 ) bchapter

2024
[39]

, Mitchell , T

bchapter Azaria , A. , Mitchell , T. : The internal state of an LLM knows when it's lying . In: Findings of the Association for Computational Linguistics: EMNLP 2023 ( 2023 ) bchapter

2023
[40]

, Liu , K

bchapter Chen , C. , Liu , K. , Chen , Z. , Gu , Y. , Wu , Y. , Tao , M. , Fu , Z. , Ye , J. : Inside: LLM 's internal states retain the power of hallucination detection . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2024 ) bchapter

2024
[41]

, Xie , L

bchapter Chuang , Y.-S. , Xie , L. , Luo , H. , Kim , Y. , Glass , J. , He , P. : Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps . In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2024 ) bchapter

2024
[42]

Fine-grained hallucination detection and editing for language models.arXivpreprintarXiv:2401.06855, 2024

botherref Mishra , A. , Celikyilmaz , A. , Hasan , S.A. : Fine-grained hallucination detection and editing for language models. arXiv preprint arXiv:2401.06855 (2024) botherref

work page arXiv 2024
[43]

, Srivatsa , A

bchapter Tang , L. , Srivatsa , A. , Huang , P.L. , Wang , Y. , Hearst , M.A. , Peng , N. , Dernoncourt , F. : MiniCheck : Efficient fact-checking of LLMs on grounding documents . In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2024 ) bchapter

2024
[44]

, Wang , B

bchapter Yue , X. , Wang , B. , Chen , Z. , Zhang , K. , Su , Y. , Sun , H. : Automatic evaluation of attribution by large language models . In: Findings of the Association for Computational Linguistics: EMNLP 2023 ( 2023 ) bchapter

2023
[45]

, Li , Y

botherref Lei , D. , Li , Y. , Hu , M. , Wang , M. , Yun , V. , Ching , E. , Kamath , A. : Chain of natural language inference for reducing large language model ungrounded hallucinations. arXiv preprint arXiv:2310.08951 (2023) botherref

work page arXiv 2023
[46]

, Li , S

botherref Zhang , Y. , Li , S. , Fung , Y.R. , Ji , H. : Knowledge overshadowing causes amalgamated hallucination in large language models. arXiv preprint arXiv:2407.08039 (2024) botherref

work page arXiv 2024
[47]

, et al.: Benchmarking hallucination in large language models

botherref Sun , T. , et al.: Benchmarking hallucination in large language models. arXiv preprint arXiv:2404.xxxxx (2024) botherref

2024
[48]

Language Models (Mostly) Know What They Know

botherref Kadavath , S. , Conerly , T. , Askell , A. , Henighan , T. , Drain , D. , Perez , E. , Schiefer , N. , Hatfield-Dodds , Z. , DasSarma , N. , Tran-Johnson , E. , et al.: Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 (2022) botherref

work page internal anchor Pith review arXiv 2022
[49]

, Liu , X

bchapter He , P. , Liu , X. , Gao , J. , Chen , W. : DeBERTa : Decoding-enhanced BERT with disentangled attention . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2021 ) bchapter

2021
[50]

, Caron , M

bchapter Izacard , G. , Caron , M. , Hosseini , L. , Riedel , S. , Bojanowski , P. , Joulin , A. , Grave , E. : Unsupervised dense information retrieval with contrastive learning . In: Transactions on Machine Learning Research (TMLR) ( 2022 ) bchapter

2022
[51]

, Varoquaux , G

barticle Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , Blondel , M. , Prettenhofer , P. , Weiss , R. , Dubourg , V. , : Scikit-learn: Machine learning in Python . Journal of Machine Learning Research 12 , 2825 -- 2830 ( 2011 ) barticle

2011
[52]

: Individual comparisons by ranking methods

barticle Wilcoxon , F. : Individual comparisons by ranking methods . Biometrics Bulletin 1 ( 6 ), 80 -- 83 ( 1945 ) barticle

1945
[53]

: Statistical Power Analysis for the Behavioral Sciences , 2nd edn

bbook Cohen , J. : Statistical Power Analysis for the Behavioral Sciences , 2nd edn. Lawrence Erlbaum Associates , ??? ( 1988 ) bbook

1988
[54]

, Tibshirani , R.J

bbook Efron , B. , Tibshirani , R.J. : An Introduction to the Bootstrap . Chapman and Hall/CRC , ??? ( 1993 ) bbook

1993