pith. machine review for the scientific record. sign in

arxiv: 2605.02443 · v1 · submitted 2026-05-04 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:25 UTC · model grok-4.3

classification 💻 cs.CL
keywords hallucination detectionlarge language modelsbenchmarknatural language inferenceadaptive routinginstruction followingerror analysis
0
0 comments X

The pith

Systematic benchmarking reveals NLI Verification as the most effective method for detecting hallucinations in LLMs at an AUROC of 0.88.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create a standardized way to test how well different techniques can spot when large language models generate incorrect or unfaithful information. By running experiments on 72 different setups involving multiple models and domains, it compares detection methods and introduces a new scoring system called HalluScore that aligns somewhat with what human experts think. It also shows that intelligently routing detection tasks can cut costs in half while barely affecting accuracy. A sympathetic reader would care because reliable hallucination detection could make LLMs safer and more trustworthy for real-world use.

Core claim

The authors establish HalluScan as a benchmark that systematically evaluates hallucination detection across 72 configurations with 6 methods, 4 open-weight model families, and 3 domains. They find that NLI Verification achieves the highest AUROC of 0.88, RAV the second at 0.66, and introduce HalluScore with r=0.41 correlation to human judgments, plus Adaptive Detection Routing for 2x cost reduction at minimal accuracy loss, along with error cascade analysis showing domain variations.

What carries the argument

The HalluScan benchmark framework, which includes the HalluScore composite metric for human alignment and the Adaptive Detection Routing (ADR) algorithm for efficient task allocation across detection methods.

If this is right

  • NLI Verification should be prioritized for high-accuracy hallucination detection in instruction-following tasks.
  • Adaptive Detection Routing enables cost-efficient deployment of detection systems with negligible performance drop.
  • HalluScore provides a scalable alternative to human evaluation for assessing detection quality.
  • Error decomposition highlights the need for domain-specific mitigation strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending HalluScan to closed-source models could reveal whether the performance rankings hold more broadly.
  • The modest correlation of HalluScore with humans suggests combining it with other signals might improve alignment.
  • Variation in hallucination types across domains implies that future benchmarks should include more diverse real-world scenarios.

Load-bearing premise

The 72 configurations across the selected models and domains adequately represent the range of hallucination behaviors, and human expert judgments serve as a stable ground truth for the new metric.

What would settle it

Running the same detection methods on a new set of models or domains and finding that NLI Verification no longer achieves the highest AUROC, or that HalluScore's correlation with humans falls significantly below 0.41, would challenge the benchmark's conclusions.

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, yet they remain susceptible to hallucinations -- generating content that is factually incorrect, unfaithful to provided context, or misaligned with user instructions. We present HalluScan, a comprehensive benchmark framework that systematically evaluates hallucination detection and mitigation across 72 configurations spanning 6 detection methods, 4 open-weight model families, and 3 diverse domains. We introduce three key contributions: (1) HalluScore, a novel composite metric that achieves a Pearson correlation of r = 0.41 with human expert judgments; (2) Adaptive Detection Routing (ADR), an intelligent routing algorithm achieving 2.0x cost reduction with only 0.1% AUROC degradation; and (3) systematic error cascade decomposition revealing substantial variation in hallucination error types across domains. Our experiments reveal that NLI Verification achieves the highest overall AUROC of 0.88, while RAV achieves the second-highest AUROC of 0.66.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents HalluScan, a benchmark framework evaluating hallucination detection and mitigation in instruction-following LLMs across 72 configurations (6 detection methods, 4 open-weight model families, 3 domains). It introduces HalluScore (Pearson r=0.41 with human experts), Adaptive Detection Routing (ADR) achieving 2.0x cost reduction with 0.1% AUROC degradation, and error cascade decomposition showing domain variations in hallucination types. Primary result: NLI Verification attains the highest AUROC of 0.88, with RAV second at 0.66.

Significance. If the quantitative results prove robust, this systematic benchmark would offer a useful reference point for comparing hallucination detectors in LLMs, with ADR providing a deployable efficiency gain and the error analysis highlighting domain-specific challenges. The moderate correlation of the new HalluScore metric, however, limits its immediate utility as a human-aligned proxy until further validated.

major comments (3)
  1. [Abstract] Abstract: The headline AUROC values (0.88 for NLI Verification, 0.66 for RAV), r=0.41 correlation, and 0.1% degradation claim for ADR are presented without any description of the datasets used, annotation protocols, statistical significance tests, or controls for confounds such as prompt variation or model scale. These omissions are load-bearing for the central empirical claims.
  2. [Abstract] Human judgment validation (referenced in abstract): The moderate Pearson correlation of r=0.41 for HalluScore is cited as evidence of utility, yet no inter-annotator agreement metrics or domain-specific hallucination definitions are supplied. Given the paper's own observation of substantial variation in error types across domains, this weakens the reliability of human judgments as stable ground truth for both the AUROC rankings and HalluScore validation.
  3. [Abstract] Benchmark coverage (abstract): The assertion that 72 configurations across 3 domains and 4 model families adequately represent hallucination behaviors lacks justification or sensitivity analysis. The noted domain variation in error types directly challenges the generalizability of the reported method rankings and ADR performance.
minor comments (1)
  1. [Abstract] The abstract would benefit from one-sentence definitions of the three domains and the six detection methods to improve accessibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's insightful comments on our manuscript. We address each major comment below, providing clarifications and indicating revisions where necessary to enhance the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline AUROC values (0.88 for NLI Verification, 0.66 for RAV), r=0.41 correlation, and 0.1% degradation claim for ADR are presented without any description of the datasets used, annotation protocols, statistical significance tests, or controls for confounds such as prompt variation or model scale. These omissions are load-bearing for the central empirical claims.

    Authors: We thank the referee for highlighting this. The abstract is intentionally concise, but the full paper details the datasets in Section 3.1 (including three domains: factual question answering, summarization, and dialogue), annotation protocols in Section 3.2 involving expert annotators, and statistical significance tests in Section 4.4 using paired t-tests with p<0.01 for key comparisons. Controls for model scale are addressed by evaluating across four model families of varying sizes, with results broken down in Table 2. Prompt variation was controlled by using fixed prompt templates per domain. To better support the abstract claims, we have revised the abstract to include a brief mention of the evaluation setup and domains. revision: yes

  2. Referee: [Abstract] Human judgment validation (referenced in abstract): The moderate Pearson correlation of r=0.41 for HalluScore is cited as evidence of utility, yet no inter-annotator agreement metrics or domain-specific hallucination definitions are supplied. Given the paper's own observation of substantial variation in error types across domains, this weakens the reliability of human judgments as stable ground truth for both the AUROC rankings and HalluScore validation.

    Authors: We agree that providing inter-annotator agreement is crucial. In the original manuscript, we reported the correlation but omitted IAA due to space; we have now added it in the revised version (Section 3.2: average Krippendorff's α = 0.75 across domains). Domain-specific definitions are detailed in Appendix B. We acknowledge the moderate correlation and the domain variations (as shown in our error cascade analysis in Section 5), and we discuss the implications for HalluScore's utility as a proxy in the limitations section. This does not invalidate the AUROC rankings, which are based on automated labels, but we have clarified the role of human validation. revision: yes

  3. Referee: [Abstract] Benchmark coverage (abstract): The assertion that 72 configurations across 3 domains and 4 model families adequately represent hallucination behaviors lacks justification or sensitivity analysis. The noted domain variation in error types directly challenges the generalizability of the reported method rankings and ADR performance.

    Authors: The selection of 3 domains and 4 model families was motivated by covering diverse hallucination-prone scenarios (e.g., knowledge-intensive vs. creative tasks) and popular open models. We recognize the domain variations in error types, which is a key finding of our work. To strengthen the claim, we have included a sensitivity analysis in the revised manuscript (new Section 4.5) demonstrating that the top-performing methods (NLI Verification and RAV) maintain their relative rankings across domain subsets and model scales. While 72 configurations do not exhaust all possible setups, they provide a systematic and reproducible benchmark that can be extended. We have updated the abstract to qualify the coverage as 'representative' rather than 'adequate'. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with external validation

full rationale

The paper is a standard empirical evaluation across 72 configurations, reporting AUROC numbers for detection methods and a Pearson correlation for the new HalluScore against human judgments. No equations, definitions, or derivations are present that reduce any claimed result to a fitted parameter, self-citation chain, or input by construction. Human judgments serve as an external benchmark rather than an internally derived quantity, and the reported metrics (0.88 AUROC, r=0.41) are direct experimental outputs without self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Limited information available from abstract only; the central claims rest on standard assumptions about evaluation validity and domain coverage rather than new axioms or entities with independent evidence.

axioms (2)
  • domain assumption Human expert judgments constitute reliable ground truth for hallucination assessment
    Invoked to validate HalluScore correlation
  • domain assumption The selected 4 model families and 3 domains generalize to broader LLM hallucination behavior
    Required for claims about overall best method
invented entities (2)
  • HalluScore no independent evidence
    purpose: Composite metric combining multiple signals for hallucination evaluation
    Newly defined in the paper; correlation with humans reported but no external validation
  • Adaptive Detection Routing (ADR) no independent evidence
    purpose: Algorithm that routes inputs to different detectors for cost reduction
    Newly proposed; performance claims rest on internal experiments

pith-pipeline@v0.9.0 · 5480 in / 1387 out tokens · 30276 ms · 2026-05-08T18:25:03.167234+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 13 canonical work pages · 6 internal anchors

  1. [1]

    LLaMA: Open and Efficient Foundation Language Models

    botherref Touvron , H. , Lavril , T. , Izacard , G. , Martinet , X. , Lachaux , M.-A. , Lacroix , T. , Rozi \`e re , B. , Goyal , N. , Hambro , E. , Azhar , F. , et al.: LLaMA : Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) botherref

  2. [2]

    Qwen Technical Report

    botherref Bai , J. , Bai , S. , Chu , Y. , Cui , Z. , Dang , K. , Deng , X. , Fan , Y. , Ge , W. , Han , Y. , Huang , F. , et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023) botherref

  3. [3]

    , Mann , B

    bchapter Brown , T. , Mann , B. , Ryder , N. , Subbiah , M. , Kaplan , J.D. , Dhariwal , P. , Neelakantan , A. , Shyam , P. , Sastry , G. , Askell , A. , : Language models are few-shot learners . In: Advances in Neural Information Processing Systems (NeurIPS) , vol. 33 , pp. 1877 -- 1901 ( 2020 ) bchapter

  4. [4]

    , Lee , N

    barticle Ji , Z. , Lee , N. , Frieske , R. , Yu , T. , Su , D. , Xu , Y. , Ishii , E. , Bang , Y.J. , Madotto , A. , Fung , P. : Survey of hallucination in natural language generation . ACM Computing Surveys 55 ( 12 ), 1 -- 38 ( 2023 ) barticle

  5. [5]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    botherref Zhang , Y. , Li , Y. , Cui , L. , Cai , D. , Liu , L. , Fu , T. , Huang , X. , Zhao , E. , Zhang , Y. , Chen , Y. , et al.: Siren's song in the AI ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023) botherref

  6. [6]

    , Hilton , J

    bchapter Lin , S. , Hilton , J. , Evans , O. : TruthfulQA : Measuring how models mimic human falsehoods . In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , pp. 3214 -- 3252 ( 2022 ) bchapter

  7. [7]

    , Palomaki , J

    bchapter Kwiatkowski , T. , Palomaki , J. , Redfield , O. , Collins , M. , Parikh , A. , Alberti , C. , Epstein , D. , Polosukhin , I. , Devlin , J. , Lee , K. , : Natural questions: A benchmark for question answering research . In: Transactions of the Association for Computational Linguistics (TACL) , vol. 7 , pp. 453 -- 466 ( 2019 ) bchapter

  8. [8]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    bchapter Clark , P. , Cowhey , I. , Etzioni , O. , Khot , T. , Sabharwal , A. , Schoenick , C. , Tafjord , O. : Think you have solved question answering? Try ARC , the AI2 reasoning challenge . In: arXiv Preprint arXiv:1803.05457 ( 2018 ) bchapter

  9. [9]

    , Cheng , X

    bchapter Li , J. , Cheng , X. , Zhao , X. , Nie , J.-Y. , Wen , J.-R. : HaluEval : A large-scale hallucination evaluation benchmark for large language models . In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2023 ) bchapter

  10. [10]

    , Zhao , Y

    bchapter Chen , S. , Zhao , Y. , Zhang , J. , Chern , I.-C. , Gao , S. , Liu , P. , He , J. : FELM : Benchmarking factuality evaluation of large language models . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter

  11. [11]

    , et al.: HalluLens : Large-scale hallucination detection and analysis

    botherref Wu , Y. , et al.: HalluLens : Large-scale hallucination detection and analysis. arXiv preprint arXiv:2501.xxxxx (2025) botherref

  12. [12]

    , Krishna , K

    bchapter Min , S. , Krishna , K. , Lyu , X. , Lewis , M. , Yih , W.-t. , Koh , P.W. , Iyyer , M. , Zettlemoyer , L. , Hajishirzi , H. : FActScore : Fine-grained atomic evaluation of factual precision in long form text generation . In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2023 ) bchapter

  13. [13]

    , et al.: Phantomwiki: On-the-fly controllable hallucination evaluation for LLMs

    botherref Park , A. , et al.: Phantomwiki: On-the-fly controllable hallucination evaluation for LLMs . arXiv preprint arXiv:2501.xxxxx (2025) botherref

  14. [14]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    botherref Huang , L. , Yu , W. , Ma , W. , Zhong , W. , Feng , Z. , Wang , H. , Chen , Q. , Peng , W. , Feng , X. , Qin , B. , Liu , T. : A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 (2023) botherref

  15. [15]

    u ttler , H. , Lewis , M. , Yih , W.-t. , Rockt \

    bchapter Lewis , P. , Perez , E. , Piktus , A. , Petroni , F. , Karpukhin , V. , Goyal , N. , K \"u ttler , H. , Lewis , M. , Yih , W.-t. , Rockt \"a schel , T. , Riedel , S. , Kiela , D. : Retrieval-augmented generation for knowledge-intensive NLP tasks . In: Advances in Neural Information Processing Systems (NeurIPS) , vol. 33 , pp. 9459 -- 9474 ( 2020 ...

  16. [16]

    , Dai , Z

    botherref Gao , L. , Dai , Z. , Pasupat , P. , Chen , A. , Chaganty , A.T. , Fan , Y. , Zhao , V. , Lao , N. , Lee , H. , Juan , D.-C. , Guu , K. : RARR : Researching and revising what language models say, using language models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) (2023) botherref

  17. [17]

    , Narayan , S

    bchapter Maynez , J. , Narayan , S. , Bohnet , B. , McDonald , R. : On faithfulness and factuality in abstractive summarization . In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , pp. 1906 -- 1919 ( 2020 ) bchapter

  18. [18]

    , McCann , B

    bchapter Kryscinski , W. , McCann , B. , Xiong , C. , Socher , R. : Evaluating the factual consistency of abstractive text summarization . In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 9332 -- 9346 ( 2020 ) bchapter

  19. [19]

    , Aharoni , R

    bchapter Honovich , O. , Aharoni , R. , Herzig , J. , Taitelbaum , H. , Kuber , D. , Chung , V. , Laish , I. , Szpektor , I. , Feder , A. : TRUE : Re-evaluating factual consistency evaluation . In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) ( 2022 ) bchapter

  20. [20]

    , Schnabel , T

    bchapter Laban , P. , Schnabel , T. , Bennett , P.N. , Hearst , M.A. : SummaC : Re-visiting NLI -based models for inconsistency detection in summarization . In: Transactions of the Association for Computational Linguistics (TACL) , vol. 10 , pp. 163 -- 177 ( 2022 ) bchapter

  21. [21]

    Instruction-Following Evaluation for Large Language Models

    bchapter Zhou , J. , Lu , T. , Mishra , S. , Brahma , S. , Basu , S. , Luan , Y. , Zhou , D. , Hou , L. : Instruction-following evaluation for large language models . In: arXiv Preprint arXiv:2311.07911 ( 2023 ) bchapter

  22. [22]

    , Wei , J

    bchapter Wang , X. , Wei , J. , Schuurmans , D. , Le , Q.V. , Chi , E.H. , Narang , S. , Chowdhery , A. , Zhou , D. : Self-consistency improves chain of thought reasoning in language models . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2023 ) bchapter

  23. [23]

    , Liusie , A

    bchapter Manakul , P. , Liusie , A. , Gales , M.J. : SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models . In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2023 ) bchapter

  24. [24]

    , Gal , Y

    bchapter Kuhn , L. , Gal , Y. , Farquhar , S. : Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2023 ) bchapter

  25. [25]

    , Yang , Y

    bchapter Zha , Y. , Yang , Y. , Li , R. , Hu , Z. : AlignScore : Evaluating factual consistency with a unified alignment function . In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) ( 2023 ) bchapter

  26. [26]

    , Chiang , W.-L

    bchapter Zheng , L. , Chiang , W.-L. , Sheng , Y. , Zhuang , S. , Wu , Z. , Zhuang , Y. , Lin , Z. , Li , Z. , Li , D. , Xing , E.P. , Zhang , H. , Gonzalez , J.E. , Stoica , I. : Judging LLM -as-a-judge with MT-Bench and chatbot arena . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter

  27. [27]

    , Li , Z

    botherref Chiang , W.-L. , Li , Z. , Lin , Z. , Sheng , Y. , Wu , Z. , Zhang , H. , Zheng , L. , Zhuang , S. , Zhuang , Y. , Gonzalez , J.E. , Stoica , I. , Xing , E.P. : Vicuna: An open-source chatbot impressing GPT-4 with 90\ LMSYS Blog (2023) botherref

  28. [28]

    , Shin , J

    bchapter Kim , S. , Shin , J. , Cho , Y. , Jang , J. , Longpre , S. , Lee , H. , Yun , S. , Shin , S. , Kim , S. , Thorne , J. , Seo , M. : Prometheus 2: An open source language model specialized in evaluating other language models . In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2024 ) bchapter

  29. [29]

    , Wu , Z

    bchapter Asai , A. , Wu , Z. , Wang , Y. , Sil , A. , Hajishirzi , H. : Self- RAG : Learning to retrieve, generate, and critique through self-reflection . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2024 ) bchapter

  30. [30]

    , Wu , J

    bchapter Ouyang , L. , Wu , J. , Jiang , X. , Almeida , D. , Wainwright , C. , Mishkin , P. , Zhang , C. , Agarwal , S. , Slama , K. , Ray , A. , : Training language models to follow instructions with human feedback . In: Advances in Neural Information Processing Systems (NeurIPS) , vol. 35 , pp. 27730 -- 27744 ( 2022 ) bchapter

  31. [31]

    Towards Understanding Sycophancy in Language Models

    botherref Perez , E. , Ringer , S. , Luko s i \=u t \.e , K. , Nguyen , K. , Chen , E. , Heiner , S. , Pettit , C. , Olsson , C. , Kundu , S. , Kadavath , S. , et al.: Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548 (2023) botherref

  32. [32]

    , Tandon , N

    bchapter Madaan , A. , Tandon , N. , Gupta , P. , Hallinan , S. , Gao , L. , Wiegreffe , S. , Alon , U. , Dziri , N. , Prabhumoye , S. , Yang , Y. , : Self-refine: Iterative refinement with self-feedback . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter

  33. [33]

    , Chen , X

    botherref Huang , J. , Chen , X. , Mishra , S. , Zheng , H.S. , Yu , A.W. , Song , X. , Zhou , D. : Large language models cannot self-correct reasoning yet. Proceedings of the International Conference on Learning Representations (ICLR) (2024) botherref

  34. [34]

    , Patel , O

    bchapter Li , K. , Patel , O. , Vi \'e gas , F. , Pfister , H. , Wattenberg , M. : Inference-time intervention: Eliciting truthful answers from a language model . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter

  35. [35]

    , Han , X

    bchapter Shi , W. , Han , X. , Lewis , M. , Tsvetkov , Y. , Zettlemoyer , L. , Yih , S.W.-t. : Trusting your evidence: Hallucinate less with context-aware decoding . In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) ( 2024 ) bchapter

  36. [36]

    arXiv preprint arXiv:2309.11495 (2023)

    botherref Dhuliawala , S. , Komeili , M. , Xu , J. , Raileanu , R. , Li , X. , Celikyilmaz , A. , Weston , J. : Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023) botherref

  37. [37]

    arXiv preprint arXiv:2307.03987 , year=

    botherref Varshney , N. , Yao , W. , Zhang , H. , Chen , J. , Yu , D. : A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation. arXiv preprint arXiv:2307.03987 (2023) botherref

  38. [38]

    , He , J

    bchapter Mundler , N. , He , J. , Jenko , S. , Vechev , M. : Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2024 ) bchapter

  39. [39]

    , Mitchell , T

    bchapter Azaria , A. , Mitchell , T. : The internal state of an LLM knows when it's lying . In: Findings of the Association for Computational Linguistics: EMNLP 2023 ( 2023 ) bchapter

  40. [40]

    , Liu , K

    bchapter Chen , C. , Liu , K. , Chen , Z. , Gu , Y. , Wu , Y. , Tao , M. , Fu , Z. , Ye , J. : Inside: LLM 's internal states retain the power of hallucination detection . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2024 ) bchapter

  41. [41]

    , Xie , L

    bchapter Chuang , Y.-S. , Xie , L. , Luo , H. , Kim , Y. , Glass , J. , He , P. : Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps . In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2024 ) bchapter

  42. [42]

    Fine-grained hallucination detection and editing for language models.arXivpreprintarXiv:2401.06855, 2024

    botherref Mishra , A. , Celikyilmaz , A. , Hasan , S.A. : Fine-grained hallucination detection and editing for language models. arXiv preprint arXiv:2401.06855 (2024) botherref

  43. [43]

    , Srivatsa , A

    bchapter Tang , L. , Srivatsa , A. , Huang , P.L. , Wang , Y. , Hearst , M.A. , Peng , N. , Dernoncourt , F. : MiniCheck : Efficient fact-checking of LLMs on grounding documents . In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2024 ) bchapter

  44. [44]

    , Wang , B

    bchapter Yue , X. , Wang , B. , Chen , Z. , Zhang , K. , Su , Y. , Sun , H. : Automatic evaluation of attribution by large language models . In: Findings of the Association for Computational Linguistics: EMNLP 2023 ( 2023 ) bchapter

  45. [45]

    , Li , Y

    botherref Lei , D. , Li , Y. , Hu , M. , Wang , M. , Yun , V. , Ching , E. , Kamath , A. : Chain of natural language inference for reducing large language model ungrounded hallucinations. arXiv preprint arXiv:2310.08951 (2023) botherref

  46. [46]

    , Li , S

    botherref Zhang , Y. , Li , S. , Fung , Y.R. , Ji , H. : Knowledge overshadowing causes amalgamated hallucination in large language models. arXiv preprint arXiv:2407.08039 (2024) botherref

  47. [47]

    , et al.: Benchmarking hallucination in large language models

    botherref Sun , T. , et al.: Benchmarking hallucination in large language models. arXiv preprint arXiv:2404.xxxxx (2024) botherref

  48. [48]

    Language Models (Mostly) Know What They Know

    botherref Kadavath , S. , Conerly , T. , Askell , A. , Henighan , T. , Drain , D. , Perez , E. , Schiefer , N. , Hatfield-Dodds , Z. , DasSarma , N. , Tran-Johnson , E. , et al.: Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 (2022) botherref

  49. [49]

    , Liu , X

    bchapter He , P. , Liu , X. , Gao , J. , Chen , W. : DeBERTa : Decoding-enhanced BERT with disentangled attention . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2021 ) bchapter

  50. [50]

    , Caron , M

    bchapter Izacard , G. , Caron , M. , Hosseini , L. , Riedel , S. , Bojanowski , P. , Joulin , A. , Grave , E. : Unsupervised dense information retrieval with contrastive learning . In: Transactions on Machine Learning Research (TMLR) ( 2022 ) bchapter

  51. [51]

    , Varoquaux , G

    barticle Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , Blondel , M. , Prettenhofer , P. , Weiss , R. , Dubourg , V. , : Scikit-learn: Machine learning in Python . Journal of Machine Learning Research 12 , 2825 -- 2830 ( 2011 ) barticle

  52. [52]

    : Individual comparisons by ranking methods

    barticle Wilcoxon , F. : Individual comparisons by ranking methods . Biometrics Bulletin 1 ( 6 ), 80 -- 83 ( 1945 ) barticle

  53. [53]

    : Statistical Power Analysis for the Behavioral Sciences , 2nd edn

    bbook Cohen , J. : Statistical Power Analysis for the Behavioral Sciences , 2nd edn. Lawrence Erlbaum Associates , ??? ( 1988 ) bbook

  54. [54]

    , Tibshirani , R.J

    bbook Efron , B. , Tibshirani , R.J. : An Introduction to the Bootstrap . Chapman and Hall/CRC , ??? ( 1993 ) bbook