pith. sign in

arxiv: 2603.09986 · v2 · submitted 2026-02-12 · 💻 cs.CL · cs.AI

Quantifying Hallucinations in Language Language Models on Medical Textbooks

Pith reviewed 2026-05-16 02:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords hallucinationslarge language modelsmedical question answeringtextbook grounded QAclinician evaluationfactuality assessment
0
0 comments X

The pith

Large language models hallucinate in 19.7 percent of medical textbook answers despite maximal plausibility ratings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper measures the rate at which large language models produce factually unsupported claims when answering questions drawn directly from medical textbooks. LLaMA-70B-Instruct hallucinated in 19.7 percent of its answers with a 95 percent confidence interval from 18.6 to 20.7, even though evaluators gave 98.8 percent of responses the highest possible plausibility score. Across several models, lower hallucination rates aligned with higher clinician usefulness ratings at a correlation of negative 0.71. The results show that human expert oversight remains necessary because current models are not ready for unsupervised use in clinical settings.

Core claim

Using textbook passages as the fixed evidence source, LLaMA-70B-Instruct produced hallucinations in 19.7 percent of answers to medical QA prompts. Almost every response received a maximal plausibility rating. Across models, hallucination frequency negatively correlated with clinician usefulness scores, and raters showed high agreement on both hallucination detection and preference judgments. The work concludes that large language models remain unfit for unsupervised clinical deployment.

What carries the argument

Evaluation that flags hallucinations by comparing model answers against the exact textbook passages provided in the prompt, combined with clinician ratings of usefulness and plausibility.

If this is right

  • Models with fewer hallucinations receive higher usefulness ratings from clinicians.
  • High plausibility scores do not prevent factual errors in medical answers.
  • Human oversight is required and constitutes the main cost of safe deployment.
  • Current models across sizes and architectures are unsuitable for unsupervised clinical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fixed-evidence hallucination benchmarks could transfer to other domains that rely on authoritative source documents.
  • Lowering hallucination rates would directly reduce the human review burden in medical applications.
  • Performance on textbook-grounded tasks may not predict behavior when source material is incomplete or conflicting.

Load-bearing premise

The supplied textbook passages contain complete information sufficient to judge whether any claim in a model response is hallucinated, and that clinician ratings accurately reflect factual correctness.

What would settle it

A follow-up study in which clinicians receive the full original textbook chapters and must mark every unsupported claim in the model responses.

Figures

Figures reproduced from arXiv: 2603.09986 by Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman.

Figure 1
Figure 1. Figure 1: Model–QA type heat-map. Each cell shows the mean annotator rank (1 = best, 8 = worst) 3.8 Evaluation Protocol We convert the clinician-verified labels into five focused metrics, each tracking a distinct aspect of hallucination be￾haviour. Plausibility & Answerability For every baseline answer, we record a five-point gp relevance score and two validity flags. Two ratios summarise these signals: Plausibility… view at source ↗
Figure 3
Figure 3. Figure 3: Annotator bias matrix (weighted). Colour shows the weighted mean rank of each model for each annotator, where weights are the number of Judgements that the annotator supplied. A black rectangular outline highlights an annotators favourite model. Inter-annotator agreements and Verification Reliability: Annotators show substantial agreement as [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Weighted Likert bias matrix. Colour shows the weighted mean Likert score for each (annotator, model) cell, where weights equal the number of judgements that annotator provided for that model. tor agreement for double-annotated items. Ideal–pair yield [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments, the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given closed-source zero-shot prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($\rho=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $\kappa=0.92$) and ($\tau_b=0.06$ to $0.18$, $\kappa=0.57$ to $0.61$) for experiments 1 and 2 respectively. Our findings indicate that, across all scales and architectures tested, current large language models remain unfit for unsupervised clinical deployment, and that human expert oversight is both necessary and the dominant cost driver.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports an empirical evaluation of hallucination rates in large language models for medical question-answering tasks grounded in textbook passages. For LLaMA-70B-Instruct, it finds a 19.7% hallucination rate (95% CI 18.6-20.7) despite 98.8% maximal plausibility ratings, and across models a negative correlation (ρ = -0.71, p=0.058) between hallucination rates and clinician usefulness scores. High inter-rater agreement is reported (κ=0.92), leading to the conclusion that LLMs require human oversight for clinical use.

Significance. If the assumption that the provided passages serve as exhaustive ground truth holds, the results provide a concrete quantification of hallucination prevalence in the medical domain and demonstrate alignment between lower hallucination and higher usefulness. The high clinician agreement strengthens the reliability of the measurements. This contributes to the literature on LLM safety in high-stakes applications by offering reproducible empirical benchmarks.

major comments (3)
  1. [Methods] Methods section on hallucination labeling: The procedure defines hallucinations as responses unsupported by the supplied textbook passages and treats these passages as complete evidence. No verification or justification is provided that the passages contain all necessary facts, so responses drawing on pretraining knowledge could be incorrectly labeled as hallucinations. This assumption is load-bearing for the central 19.7% rate and the usefulness correlation.
  2. [Experiment 1] Experiment 1 results and abstract: The 19.7% hallucination rate (95% CI 18.6-20.7) and 98.8% maximal plausibility are reported, but exact prompt construction details and passage selection criteria are insufficiently specified, limiting reproducibility and assessment of whether the passages are representative.
  3. [Results] Experiment 2 results: The reported correlation ρ=-0.71 (p=0.058) between hallucination rates and usefulness scores is based on a small number of models; the borderline significance and sensitivity to model choice require more discussion to support the alignment claim.
minor comments (2)
  1. [Title] Title contains the repeated phrase 'Language Language Models', which is a typographical error for 'Large Language Models'.
  2. [Abstract] Abstract states 'closed-source zero-shot prompts' without providing the exact wording or construction method used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We respond to each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods section on hallucination labeling: The procedure defines hallucinations as responses unsupported by the supplied textbook passages and treats these passages as complete evidence. No verification or justification is provided that the passages contain all necessary facts, so responses drawing on pretraining knowledge could be incorrectly labeled as hallucinations. This assumption is load-bearing for the central 19.7% rate and the usefulness correlation.

    Authors: We agree this is a key assumption in our grounded evaluation setup. The study specifically measures hallucinations relative to the provided textbook passages, as is standard in context-grounded QA to control for external knowledge. We will revise the Methods section to explicitly state and justify this assumption, and add a discussion in the Limitations section acknowledging that some responses may draw on pretraining knowledge not present in the passages. revision: yes

  2. Referee: [Experiment 1] Experiment 1 results and abstract: The 19.7% hallucination rate (95% CI 18.6-20.7) and 98.8% maximal plausibility are reported, but exact prompt construction details and passage selection criteria are insufficiently specified, limiting reproducibility and assessment of whether the passages are representative.

    Authors: We appreciate this feedback on reproducibility. In the revised version, we will provide the exact zero-shot prompt templates, the full details on how textbook passages were selected and excerpted (including source textbooks and criteria for relevance), and any additional information needed to replicate the dataset. revision: yes

  3. Referee: [Results] Experiment 2 results: The reported correlation ρ=-0.71 (p=0.058) between hallucination rates and usefulness scores is based on a small number of models; the borderline significance and sensitivity to model choice require more discussion to support the alignment claim.

    Authors: We concur that the number of models is small and the p-value is borderline. We will add more discussion in the Results and Discussion sections addressing the sensitivity to model selection, the implications of the small sample, and temper the strength of the alignment claim accordingly while still reporting the observed correlation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper reports direct experimental measurements of hallucination prevalence and clinician preference ratings on model outputs against supplied textbook passages. No derivation chain, equations, fitted parameters presented as predictions, or self-referential definitions exist. Prevalence (19.7%) and correlation (ρ=-0.71) are computed from raw counts and ratings without reduction to inputs by construction. Self-citations are absent from load-bearing claims; the methodology relies on external clinician judgments and provided evidence rather than any tautological loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Minimal ledger; the study relies on standard statistical assumptions for confidence intervals and rank correlations rather than new parameters or entities.

axioms (1)
  • standard math Standard assumptions underlying binomial confidence intervals and Spearman rank correlation
    Invoked for the reported 95% CI and ρ value.

pith-pipeline@v0.9.0 · 5597 in / 1101 out tokens · 76787 ms · 2026-05-16T02:23:27.897264+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    LongHealth: A ques- tion answering benchmark with long clinical documents

    [Adamset al., 2025 ] Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexan- der L ¨oser, Hugo J W L Aerts, Jakob Nikolas Kather, Daniel Truhn, and Keno Bressem. LongHealth: A ques- tion answering benchmark with long clinical documents. J. Healthc. Inform. Res., 9(3):280–296, September

  2. [2]

    [Aroraet al., nd ] Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui ˜nonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large lan- guage models towards improved human health. Technical report, OpenAI, n.d. PDF provided by ...

  3. [3]

    [Changet al., 2023 ] Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman

    Association for Computational Linguistics. [Changet al., 2023 ] Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman. Speak, memory: An archaeology of books known to ChatGPT/GPT-4. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Pro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7312–7327, Sin- ga...

  4. [4]

    [Choenniet al., 2023 ] Sunil Choenni, Tony Busker, and Mortaza S

    Association for Computational Linguistics. [Choenniet al., 2023 ] Sunil Choenni, Tony Busker, and Mortaza S. Bargh. Generating synthetic data from large language models. In2023 15th International Conference on Innovations in Information Technology (IIT), pages 73– 78,

  5. [5]

    Knowledge-practice performance gap in clinical large language models: Sys- tematic review of 39 benchmarks.J

    [Gonget al., 2025 ] Eun Jeong Gong, Chang Seok Bang, Jae Jun Lee, and Gwang Ho Baik. Knowledge-practice performance gap in clinical large language models: Sys- tematic review of 39 benchmarks.J. Med. Internet Res., 27:e84120, December

  6. [6]

    Synthetic data generation using large lan- guage models for financial question answering

    [Harshaet al., 2025 ] Chetan Harsha, Karmvir Singh Phogat, Sridhar Dasaratha, Sai Akhil Puranam, and Shashishekar Ramakrishna. Synthetic data generation using large lan- guage models for financial question answering. In Chung- Chi Chen, Antonio Moreno-Sandoval, Jimin Huang, Qian- qian Xie, Sophia Ananiadou, and Hsin-Hsi Chen, editors, Proceedings of the J...

  7. [7]

    [Jinet al., 2019 ] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu

    Association for Computational Linguistics. [Jinet al., 2019 ] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InPro- ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Pro...

  8. [8]

    What dis- ease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421,

    [Jinet al., 2021 ] Di Jin, Eileen Pan, Nassim Oufattole, Wei- Hung Weng, Hanyi Fang, and Peter Szolovits. What dis- ease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421,

  9. [9]

    Why language models hallucinate

    [Kalaiet al., 2025 ] Adam Tauman Kalai, Ofir Nachum, San- tosh S Vempala, and Edwin Zhang. Why language models hallucinate

  10. [10]

    Marshall

    [Kellet al., 2024 ] Gregory Kell, Angus Roberts, Serge Umansky, Yuti Khare, Najma Ahmed, Nikhil Patel, Chloe Simela, Jack Coumbe, Julian Rozario, Ryan-Rhys Grif- fiths, and Iain J. Marshall. Realmedqa: A pilot biomedi- cal question answering dataset containing realistic clinical questions,

  11. [11]

    MedExQA: Medical question an- swering benchmark with multiple explanations

    [Kimet al., 2024 ] Yunsoo Kim, Jinge Wu, Yusuf Abdulle, and Honghan Wu. MedExQA: Medical question an- swering benchmark with multiple explanations. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, and Junichi Tsujii, editors,Proceedings of the 23rd Workshop on Biomedical Natural Language Pro- cessing, pages 167–181, Bangkok, Thailand, August

  12. [12]

    Association for Computational Linguistics. [Kimet al., 2025 ] Yubin Kim, Hyewon Jeong, Shen Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo R Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Pu Liang, Xuhai Xu, Xin Liu, Chunjong Par...

  13. [13]

    Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific Data, 10(1), March

    [Kritharaet al., 2023 ] Anastasia Krithara, Anastasios Nen- tidis, Konstantinos Bougiatiotis, and Georgios Paliouras. Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific Data, 10(1), March

  14. [14]

    Auto-generating question-answering datasets with domain-specific knowledge for language models in scientific tasks.Digit

    [Li and Cole, 2025] Zongqian Li and Jacqueline M Cole. Auto-generating question-answering datasets with domain-specific knowledge for language models in scientific tasks.Digit. Discov., 4(4):998–1005,

  15. [15]

    Multi-model assur- ance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clin- ical decision support.Commun

    [Omaret al., 2025 ] Mahmud Omar, Vera Sorin, Jeremy D Collins, David Reich, Robert Freeman, Nicholas Gavin, Alexander Charney, Lisa Stump, Nicola Luigi Bragazzi, Girish N Nadkarni, and Eyal Klang. Multi-model assur- ance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clin- ical decision support.Com...

  16. [16]

    Medmcqa : A large-scale multi-subject multi-choice dataset for medical domain question answering

    [Palet al., 2022 ] Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa : A large-scale multi-subject multi-choice dataset for medical domain question answering

  17. [17]

    NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark

    [Sainzet al., 2023 ] Oscar Sainz, Jon Campos, Iker Garc ´ıa- Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–1...

  18. [18]

    [Schwartzet al., 2024 ] Ilan S Schwartz, Katherine E Link, Roxana Daneshjou, and Nicol ´as Cort ´es-Penfield

    Association for Computational Linguistics. [Schwartzet al., 2024 ] Ilan S Schwartz, Katherine E Link, Roxana Daneshjou, and Nicol ´as Cort ´es-Penfield. Black box warning: Large language models and the future of infectious diseases consultation.Clin. Infect. Dis., 78(4):860–866, April

  19. [19]

    Toward expert-level medical question answering with large lan- guage models.Nat

    [Singhalet al., 2025 ] Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, Dar- lene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H Chen, Nigam H Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, ...

  20. [20]

    Trustworthy medical question answering: An evaluation-centric survey

    [Wanget al., 2025b ] Yinuo Wang, Baiyang Wang, Robert Mercer, Frank Rudzicz, Sudipta Singha Roy, Pengjie Ren, Zhumin Chen, and Xindi Wang. Trustworthy medical question answering: An evaluation-centric survey. In Christos Christodoulopoulos, Tanmoy Chakraborty, Car- olyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods...

  21. [21]

    [Welblet al., 2017 ] Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel

    Association for Computational Linguis- tics. [Welblet al., 2017 ] Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. Constructing datasets for multi-hop reading comprehension across documents

  22. [22]

    Automatic dataset generation for knowledge intensive question answering tasks

    [Yuenet al., 2025 ] Sizhe Yuen, Ting Su, Ziyang Wang, Yali Du, and Adam J Sobey. Automatic dataset generation for knowledge intensive question answering tasks

  23. [23]

    Siren’s song in the AI ocean: A survey on hallucination in large language models

    [Zhanget al., 2023 ] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the AI ocean: A survey on hallucination in large language models

  24. [24]

    Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models

    [Zhuet al., 2025 ] Zhihong Zhu, Yunyan Zhang, Xianwei Zhuang, Fan Zhang, Zhongwei Wan, Yuyan Chen, Qingqing Long, Yefeng Zheng, and Xian Wu. Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models. In Wanxi- ang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar, editors,Findings of t...