Quantifying Hallucinations in Language Language Models on Medical Textbooks
Pith reviewed 2026-05-16 02:23 UTC · model grok-4.3
The pith
Large language models hallucinate in 19.7 percent of medical textbook answers despite maximal plausibility ratings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using textbook passages as the fixed evidence source, LLaMA-70B-Instruct produced hallucinations in 19.7 percent of answers to medical QA prompts. Almost every response received a maximal plausibility rating. Across models, hallucination frequency negatively correlated with clinician usefulness scores, and raters showed high agreement on both hallucination detection and preference judgments. The work concludes that large language models remain unfit for unsupervised clinical deployment.
What carries the argument
Evaluation that flags hallucinations by comparing model answers against the exact textbook passages provided in the prompt, combined with clinician ratings of usefulness and plausibility.
If this is right
- Models with fewer hallucinations receive higher usefulness ratings from clinicians.
- High plausibility scores do not prevent factual errors in medical answers.
- Human oversight is required and constitutes the main cost of safe deployment.
- Current models across sizes and architectures are unsuitable for unsupervised clinical use.
Where Pith is reading between the lines
- Fixed-evidence hallucination benchmarks could transfer to other domains that rely on authoritative source documents.
- Lowering hallucination rates would directly reduce the human review burden in medical applications.
- Performance on textbook-grounded tasks may not predict behavior when source material is incomplete or conflicting.
Load-bearing premise
The supplied textbook passages contain complete information sufficient to judge whether any claim in a model response is hallucinated, and that clinician ratings accurately reflect factual correctness.
What would settle it
A follow-up study in which clinicians receive the full original textbook chapters and must mark every unsupported claim in the model responses.
Figures
read the original abstract
Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments, the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given closed-source zero-shot prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($\rho=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $\kappa=0.92$) and ($\tau_b=0.06$ to $0.18$, $\kappa=0.57$ to $0.61$) for experiments 1 and 2 respectively. Our findings indicate that, across all scales and architectures tested, current large language models remain unfit for unsupervised clinical deployment, and that human expert oversight is both necessary and the dominant cost driver.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical evaluation of hallucination rates in large language models for medical question-answering tasks grounded in textbook passages. For LLaMA-70B-Instruct, it finds a 19.7% hallucination rate (95% CI 18.6-20.7) despite 98.8% maximal plausibility ratings, and across models a negative correlation (ρ = -0.71, p=0.058) between hallucination rates and clinician usefulness scores. High inter-rater agreement is reported (κ=0.92), leading to the conclusion that LLMs require human oversight for clinical use.
Significance. If the assumption that the provided passages serve as exhaustive ground truth holds, the results provide a concrete quantification of hallucination prevalence in the medical domain and demonstrate alignment between lower hallucination and higher usefulness. The high clinician agreement strengthens the reliability of the measurements. This contributes to the literature on LLM safety in high-stakes applications by offering reproducible empirical benchmarks.
major comments (3)
- [Methods] Methods section on hallucination labeling: The procedure defines hallucinations as responses unsupported by the supplied textbook passages and treats these passages as complete evidence. No verification or justification is provided that the passages contain all necessary facts, so responses drawing on pretraining knowledge could be incorrectly labeled as hallucinations. This assumption is load-bearing for the central 19.7% rate and the usefulness correlation.
- [Experiment 1] Experiment 1 results and abstract: The 19.7% hallucination rate (95% CI 18.6-20.7) and 98.8% maximal plausibility are reported, but exact prompt construction details and passage selection criteria are insufficiently specified, limiting reproducibility and assessment of whether the passages are representative.
- [Results] Experiment 2 results: The reported correlation ρ=-0.71 (p=0.058) between hallucination rates and usefulness scores is based on a small number of models; the borderline significance and sensitivity to model choice require more discussion to support the alignment claim.
minor comments (2)
- [Title] Title contains the repeated phrase 'Language Language Models', which is a typographical error for 'Large Language Models'.
- [Abstract] Abstract states 'closed-source zero-shot prompts' without providing the exact wording or construction method used.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We respond to each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section on hallucination labeling: The procedure defines hallucinations as responses unsupported by the supplied textbook passages and treats these passages as complete evidence. No verification or justification is provided that the passages contain all necessary facts, so responses drawing on pretraining knowledge could be incorrectly labeled as hallucinations. This assumption is load-bearing for the central 19.7% rate and the usefulness correlation.
Authors: We agree this is a key assumption in our grounded evaluation setup. The study specifically measures hallucinations relative to the provided textbook passages, as is standard in context-grounded QA to control for external knowledge. We will revise the Methods section to explicitly state and justify this assumption, and add a discussion in the Limitations section acknowledging that some responses may draw on pretraining knowledge not present in the passages. revision: yes
-
Referee: [Experiment 1] Experiment 1 results and abstract: The 19.7% hallucination rate (95% CI 18.6-20.7) and 98.8% maximal plausibility are reported, but exact prompt construction details and passage selection criteria are insufficiently specified, limiting reproducibility and assessment of whether the passages are representative.
Authors: We appreciate this feedback on reproducibility. In the revised version, we will provide the exact zero-shot prompt templates, the full details on how textbook passages were selected and excerpted (including source textbooks and criteria for relevance), and any additional information needed to replicate the dataset. revision: yes
-
Referee: [Results] Experiment 2 results: The reported correlation ρ=-0.71 (p=0.058) between hallucination rates and usefulness scores is based on a small number of models; the borderline significance and sensitivity to model choice require more discussion to support the alignment claim.
Authors: We concur that the number of models is small and the p-value is borderline. We will add more discussion in the Results and Discussion sections addressing the sensitivity to model selection, the implications of the small sample, and temper the strength of the alignment claim accordingly while still reporting the observed correlation. revision: partial
Circularity Check
No circularity: purely empirical measurement study
full rationale
The paper reports direct experimental measurements of hallucination prevalence and clinician preference ratings on model outputs against supplied textbook passages. No derivation chain, equations, fitted parameters presented as predictions, or self-referential definitions exist. Prevalence (19.7%) and correlation (ρ=-0.71) are computed from raw counts and ratings without reduction to inputs by construction. Self-citations are absent from load-bearing claims; the methodology relies on external clinician judgments and provided evidence rather than any tautological loop.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions underlying binomial confidence intervals and Spearman rank correlation
Reference graph
Works this paper leans on
-
[1]
LongHealth: A ques- tion answering benchmark with long clinical documents
[Adamset al., 2025 ] Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexan- der L ¨oser, Hugo J W L Aerts, Jakob Nikolas Kather, Daniel Truhn, and Keno Bressem. LongHealth: A ques- tion answering benchmark with long clinical documents. J. Healthc. Inform. Res., 9(3):280–296, September
work page 2025
-
[2]
[Aroraet al., nd ] Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui ˜nonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large lan- guage models towards improved human health. Technical report, OpenAI, n.d. PDF provided by ...
work page 2024
-
[3]
[Changet al., 2023 ] Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman
Association for Computational Linguistics. [Changet al., 2023 ] Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman. Speak, memory: An archaeology of books known to ChatGPT/GPT-4. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Pro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7312–7327, Sin- ga...
work page 2023
-
[4]
[Choenniet al., 2023 ] Sunil Choenni, Tony Busker, and Mortaza S
Association for Computational Linguistics. [Choenniet al., 2023 ] Sunil Choenni, Tony Busker, and Mortaza S. Bargh. Generating synthetic data from large language models. In2023 15th International Conference on Innovations in Information Technology (IIT), pages 73– 78,
work page 2023
-
[5]
[Gonget al., 2025 ] Eun Jeong Gong, Chang Seok Bang, Jae Jun Lee, and Gwang Ho Baik. Knowledge-practice performance gap in clinical large language models: Sys- tematic review of 39 benchmarks.J. Med. Internet Res., 27:e84120, December
work page 2025
-
[6]
Synthetic data generation using large lan- guage models for financial question answering
[Harshaet al., 2025 ] Chetan Harsha, Karmvir Singh Phogat, Sridhar Dasaratha, Sai Akhil Puranam, and Shashishekar Ramakrishna. Synthetic data generation using large lan- guage models for financial question answering. In Chung- Chi Chen, Antonio Moreno-Sandoval, Jimin Huang, Qian- qian Xie, Sophia Ananiadou, and Hsin-Hsi Chen, editors, Proceedings of the J...
work page 2025
-
[7]
[Jinet al., 2019 ] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu
Association for Computational Linguistics. [Jinet al., 2019 ] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InPro- ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Pro...
work page 2019
-
[8]
[Jinet al., 2021 ] Di Jin, Eileen Pan, Nassim Oufattole, Wei- Hung Weng, Hanyi Fang, and Peter Szolovits. What dis- ease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421,
work page 2021
-
[9]
Why language models hallucinate
[Kalaiet al., 2025 ] Adam Tauman Kalai, Ofir Nachum, San- tosh S Vempala, and Edwin Zhang. Why language models hallucinate
work page 2025
-
[10]
[Kellet al., 2024 ] Gregory Kell, Angus Roberts, Serge Umansky, Yuti Khare, Najma Ahmed, Nikhil Patel, Chloe Simela, Jack Coumbe, Julian Rozario, Ryan-Rhys Grif- fiths, and Iain J. Marshall. Realmedqa: A pilot biomedi- cal question answering dataset containing realistic clinical questions,
work page 2024
-
[11]
MedExQA: Medical question an- swering benchmark with multiple explanations
[Kimet al., 2024 ] Yunsoo Kim, Jinge Wu, Yusuf Abdulle, and Honghan Wu. MedExQA: Medical question an- swering benchmark with multiple explanations. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, and Junichi Tsujii, editors,Proceedings of the 23rd Workshop on Biomedical Natural Language Pro- cessing, pages 167–181, Bangkok, Thailand, August
work page 2024
-
[12]
Association for Computational Linguistics. [Kimet al., 2025 ] Yubin Kim, Hyewon Jeong, Shen Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo R Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Pu Liang, Xuhai Xu, Xin Liu, Chunjong Par...
work page 2025
-
[13]
Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific Data, 10(1), March
[Kritharaet al., 2023 ] Anastasia Krithara, Anastasios Nen- tidis, Konstantinos Bougiatiotis, and Georgios Paliouras. Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific Data, 10(1), March
work page 2023
-
[14]
[Li and Cole, 2025] Zongqian Li and Jacqueline M Cole. Auto-generating question-answering datasets with domain-specific knowledge for language models in scientific tasks.Digit. Discov., 4(4):998–1005,
work page 2025
-
[15]
[Omaret al., 2025 ] Mahmud Omar, Vera Sorin, Jeremy D Collins, David Reich, Robert Freeman, Nicholas Gavin, Alexander Charney, Lisa Stump, Nicola Luigi Bragazzi, Girish N Nadkarni, and Eyal Klang. Multi-model assur- ance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clin- ical decision support.Com...
work page 2025
-
[16]
Medmcqa : A large-scale multi-subject multi-choice dataset for medical domain question answering
[Palet al., 2022 ] Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa : A large-scale multi-subject multi-choice dataset for medical domain question answering
work page 2022
-
[17]
NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark
[Sainzet al., 2023 ] Oscar Sainz, Jon Campos, Iker Garc ´ıa- Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–1...
work page 2023
-
[18]
Association for Computational Linguistics. [Schwartzet al., 2024 ] Ilan S Schwartz, Katherine E Link, Roxana Daneshjou, and Nicol ´as Cort ´es-Penfield. Black box warning: Large language models and the future of infectious diseases consultation.Clin. Infect. Dis., 78(4):860–866, April
work page 2024
-
[19]
Toward expert-level medical question answering with large lan- guage models.Nat
[Singhalet al., 2025 ] Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, Dar- lene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H Chen, Nigam H Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, ...
work page 2025
-
[20]
Trustworthy medical question answering: An evaluation-centric survey
[Wanget al., 2025b ] Yinuo Wang, Baiyang Wang, Robert Mercer, Frank Rudzicz, Sudipta Singha Roy, Pengjie Ren, Zhumin Chen, and Xindi Wang. Trustworthy medical question answering: An evaluation-centric survey. In Christos Christodoulopoulos, Tanmoy Chakraborty, Car- olyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods...
work page 2025
-
[21]
[Welblet al., 2017 ] Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel
Association for Computational Linguis- tics. [Welblet al., 2017 ] Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. Constructing datasets for multi-hop reading comprehension across documents
work page 2017
-
[22]
Automatic dataset generation for knowledge intensive question answering tasks
[Yuenet al., 2025 ] Sizhe Yuen, Ting Su, Ziyang Wang, Yali Du, and Adam J Sobey. Automatic dataset generation for knowledge intensive question answering tasks
work page 2025
-
[23]
Siren’s song in the AI ocean: A survey on hallucination in large language models
[Zhanget al., 2023 ] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the AI ocean: A survey on hallucination in large language models
work page 2023
-
[24]
[Zhuet al., 2025 ] Zhihong Zhu, Yunyan Zhang, Xianwei Zhuang, Fan Zhang, Zhongwei Wan, Yuyan Chen, Qingqing Long, Yefeng Zheng, and Xian Wu. Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models. In Wanxi- ang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar, editors,Findings of t...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.