Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning

Jose Eduardo Lizarraga Mazaba; Keke Chen; Yiming Liao; Zeno Franco

arxiv: 2606.01301 · v1 · pith:FNONAPQ7new · submitted 2026-05-31 · 💻 cs.CL

Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning

Yiming Liao , Zeno Franco , Jose Eduardo Lizarraga Mazaba , Keke Chen This is my paper

Pith reviewed 2026-06-28 17:15 UTC · model grok-4.3

classification 💻 cs.CL

keywords medical LLMshallucinationsself-critiquein-context learningEHRclinical question answeringmitigationMIMIC-IV

0 comments

The pith

Self-critique improves accuracy on clinical questions for three of five medical LLMs without parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Med-HEAL, a framework that builds a labeled dataset of hallucinations from medical LLMs answering questions based on real electronic health records. It evaluates outputs using both an automated judge and human medical reviewers to mark errors. Two mitigation approaches are tested: a self-critique step where the model checks its own answer and a retrieval method that shows examples of past mistakes and fixes. Experiments on five open-source models find that self-critique raises accuracy in three cases with statistical significance. The work supplies both the dataset and the practical steps so that medical LLMs can be adjusted at inference time.

Core claim

Med-HEAL constructs a hallucination dataset from BioMistral-7B responses on the EHRNoteQA benchmark drawn from MIMIC-IV discharge summaries, labels outputs via a dual GPT-4o and medical-student review pipeline, then shows that a self-critique pipeline improves accuracy for three of the five tested models (BioMistral, Llama-3.1, DeepSeek, Qwen2.5, Qwen3) at p < 0.05 while retrieval-augmented in-context learning is also examined.

What carries the argument

The self-critique pipeline in which the test model reviews its own answers to detect potential errors and regenerates responses for flagged cases.

If this is right

Medical LLMs can achieve higher accuracy on clinical QA tasks through an inference-time self-critique step that requires no parameter changes.
The constructed hallucination dataset with correctness judgments and reasoning-error annotations becomes reusable for studying mitigation methods.
The same dual LLM-plus-human labeling process can be applied to annotate errors in other clinical reasoning tasks.
Retrieval-augmented in-context learning that supplies hallucinated and corrected examples offers a second tested mitigation route.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be combined with human review in actual clinical workflows to catch remaining errors the model itself misses.
Similar self-critique might reduce hallucinations when models reason over other structured medical data such as lab results or imaging reports.
The framework provides a template for building domain-specific hallucination datasets in fields outside medicine where factual grounding matters.
Testing whether the accuracy gains persist when questions are drawn from newer or more diverse hospital records would strengthen the result.

Load-bearing premise

The dual evaluation pipeline that combines GPT-4o judgments with medical student auditing produces reliable correctness labels and hallucination annotations.

What would settle it

Applying the identical self-critique procedure to a sixth open-source medical LLM on the same EHRNoteQA-derived questions and finding no statistically significant accuracy gain would falsify the reported benefit.

Figures

Figures reproduced from arXiv: 2606.01301 by Jose Eduardo Lizarraga Mazaba, Keke Chen, Yiming Liao, Zeno Franco.

**Figure 1.** Figure 1: Overall Pipeline of Med-HEAL. Phase 1 (Left): LLM-generated labels are audited against human ground truth provided [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Hallucinations in medical large language models (LLMs) pose serious risks for clinical decision support, particularly when models must reason over complex electronic health records (EHRs). However, existing benchmarks often lack a realistic clinical context and provide limited insight into how hallucinations can be mitigated in practice. We introduce Med-HEAL, a framework for systematically identifying, analyzing, and mitigating hallucinations in medical LLMs using clinically grounded data. Building on the EHRNoteQA benchmark derived from MIMIC-IV discharge summaries, we construct a hallucination dataset by evaluating BioMistral-7B on open-ended clinical question answering tasks. Model outputs are labeled through a dual evaluation pipeline that combines LLM-as-a-Judge assessment (GPT-4o) with human auditing by medical student reviewers, producing correctness judgments and annotations of reasoning errors via a custom web-based evaluation system. We then leverage this dataset to investigate mitigation strategies: a self-critique pipeline, in which the test model reviews its own answers to detect potential errors and regenerates responses for flagged cases, and retrieval-augmented in-context learning (RA-ICL), which exposes the model to hallucinated and corrected examples. Experiments across five open-source LLMs-BioMistral, Llama-3.1, DeepSeek, Qwen2.5, and Qwen3, show that the self-critique strategy improves accuracy for three of five models (p < 0.05) without requiring parameter updates. Med-HEAL provides both a reusable hallucination dataset and a practical framework for studying and mitigating hallucinations in medical LLMs, supporting safer deployment of AI systems in clinical environments. Our code and data are publicly available at https://github.com/yimingliao-blad/med-heal.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Med-HEAL ships a public hallucination dataset from EHRNoteQA and shows self-critique lifting accuracy on three of five models, but the whole result chain depends on unverified GPT-4o plus student labels.

read the letter

The main thing to know is that the authors built a new set of labeled hallucination examples by running BioMistral on EHRNoteQA questions, then used GPT-4o plus medical-student review to mark correctness and error types. They test two mitigation ideas on five models and report that self-critique raises accuracy for three of them at p < 0.05, all without parameter updates. The code and data are released.

What works is the practical focus. They actually run the self-critique loop and the retrieval-augmented in-context examples on real open models, and they make the resource available so others can use it. That is more than just another benchmark paper.

The soft spot is the labeling step. The dual pipeline is described, but the abstract and stress-test note give no inter-rater numbers, no comparison to attending physicians, and no error analysis on the GPT-4o judgments. If those labels contain systematic misses on clinical reasoning, both the training examples and the reported accuracy gains become noisy. The paper also skips effect sizes and full baseline tables, so it is hard to judge how large or stable the gains are.

This is for groups that need ready-made medical hallucination data or want to try lightweight in-context fixes before fine-tuning. A reader who already works on LLM evaluation in healthcare will get concrete examples to inspect or extend.

It deserves peer review. The empirical setup is clear enough that referees can check the annotation process and ask for the missing controls.

Referee Report

1 major / 2 minor

Summary. The paper claims to introduce the Med-HEAL framework for identifying, analyzing, and mitigating hallucinations in medical LLMs. It constructs a hallucination dataset by running BioMistral-7B on the EHRNoteQA benchmark (derived from MIMIC-IV), labeling outputs via a dual pipeline of GPT-4o LLM-as-Judge plus medical-student auditing, then tests self-critique and RA-ICL mitigation strategies across five open-source LLMs, reporting statistically significant accuracy gains (p < 0.05) from self-critique on three of the five models without parameter updates. Code and data are released publicly.

Significance. If the dual-pipeline correctness labels can be validated, the work supplies a reusable clinical hallucination dataset and a practical, training-free mitigation technique that could improve safety of LLM-based clinical QA. The public release of code and data is a clear strength for reproducibility.

major comments (1)

[Abstract and Methods (dual evaluation pipeline)] Abstract and Methods (dual evaluation pipeline): The correctness judgments that determine both the hallucination dataset labels and the accuracy metrics rest on the dual GPT-4o + medical-student pipeline. No inter-rater agreement statistic, no validation against board-certified physicians, and no error analysis of GPT-4o medical judgments are reported. This is load-bearing for the central empirical claims.

minor comments (2)

[Results] Results: Absolute accuracy deltas, effect sizes, and full per-model baseline tables are not provided alongside the p < 0.05 statements, making it hard to judge practical significance.
[Abstract] Abstract: The five models are listed but exact versions, parameter counts, and which three showed improvement are not cross-referenced to the experimental tables.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation pipeline, which is indeed central to our claims. We address the concerns point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract and Methods (dual evaluation pipeline): The correctness judgments that determine both the hallucination dataset labels and the accuracy metrics rest on the dual GPT-4o + medical-student pipeline. No inter-rater agreement statistic, no validation against board-certified physicians, and no error analysis of GPT-4o medical judgments are reported. This is load-bearing for the central empirical claims.

Authors: We agree the dual pipeline requires stronger validation reporting. In revision we will: (1) compute and report inter-rater agreement (Cohen's kappa) between GPT-4o labels and medical-student audits, plus agreement among student reviewers where multiple annotations exist; (2) add an error analysis of GPT-4o judgments on a random 10% subset, categorizing disagreement types; (3) explicitly state the absence of board-certified physician validation as a limitation due to resource constraints, while noting that medical students were supervised by clinical faculty. These changes will appear in Methods, Results, and a new Limitations section. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical results from external model runs on constructed dataset.

full rationale

The paper's central claim is an empirical observation: self-critique improves accuracy (p<0.05) on three of five LLMs. This is obtained by running the models, applying the mitigation strategy, and measuring accuracy via the described dual pipeline. No equations, fitted parameters renamed as predictions, or self-citation chains reduce the result to its inputs by construction. The evaluation pipeline is an external measurement step, not a self-referential definition. The work is self-contained against its own benchmarks and external models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides limited visibility into modeling choices; the main unexamined premise is the reliability of the dual labeling pipeline.

axioms (1)

domain assumption GPT-4o combined with medical student review can accurately identify hallucinations and reasoning errors in clinical QA outputs
This premise underpins the entire hallucination dataset construction described in the abstract.

pith-pipeline@v0.9.1-grok · 5877 in / 1424 out tokens · 32214 ms · 2026-06-28T17:15:59.179880+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 17 canonical work pages · 7 internal anchors

[1]

Asma Ben Abacha, Chaitanya Shivade, and Dina Demner-Fushman. 2019. Overview of the MEDIQA 2019 shared task on textual inference, question entail- ment and question answering. InProceedings of the 18th bioNLP workshop and shared task. 370–379

2019
[2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, and Nishanth Sastry. 2024. Medhalu: Hallucinations in responses to healthcare queries by large language models.arXiv preprint arXiv:2409.19492 (2024)

work page arXiv 2024
[4]

Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag
[5]

In Proceedings of the 2022 conference on empirical methods in natural language processing

Large language models are few-shot clinical information extractors. In Proceedings of the 2022 conference on empirical methods in natural language processing. 1998–2022

2022
[6]

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou
[7]

What learning algorithm is in-context learning? investigations with linear models.arXiv preprint arXiv:2211.15661(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering.BMC bioinformatics20, 1 (2019), 511

2019
[9]

Stephanie Cabral, Daniel Restrepo, Zahir Kanjee, Philip Wilson, Byron Crowe, Raja-Elie Abdulnour, and Adam Rodman. 2024. Clinical reasoning of a generative artificial intelligence model compared with physicians.JAMA internal medicine 184, 5 (2024), 581–583

2024
[10]

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bon- net, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, and others. 2023. Meditron-70b: Scaling medical pre- training for large language models.arXiv preprint arXiv:2311.16079(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Philip Chung, Akshay Swaminathan, Alex J Goodell, Yeasul Kim, S Mom- sen Reincke, Lichy Han, Ben Deverett, Mohammad Amin Sadeghi, Abdel-Badih Ariss, Marc Ghanem, and others. 2025. Verifying Facts in Patient Care Docu- ments Generated by Large Language Models Using Electronic Health Records. NEJM AI3, 1 (2025), AIdbp2500418

2025
[12]

DeepSeek-AI. 2025. DeepSeek-R1-Distill-Llama-8B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Llama-8B

2025
[13]

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2024. Chain-of-verification reduces hallucina- tion in large language models. InFindings of the association for computational linguistics: ACL 2024. 3563–3578. Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aw...

2024
[14]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024). https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Philip Feldman, James R Foulds, and Shimei Pan. 2023. Trapping llm hallucina- tions using tagged context prompts.arXiv preprint arXiv:2306.06085(2023)

work page arXiv 2023
[16]

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. 2022. What can transformers learn in-context? a case study of simple function classes.Ad- vances in neural information processing systems35 (2022), 30583–30598

2022
[17]

Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. 2024. Does fine-tuning llms on new knowledge encourage hallucinations?. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 7765–7784

2024
[18]

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio
[19]

An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211(2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[20]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, and others. 2024. A survey on llm-as-a-judge.The Innovation(2024)

2024
[21]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. DeepSeek-R1 incen- tivizes reasoning in LLMs through reinforcement learning.Nature645, 8081 (2025), 633–638

2025
[22]

Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexei Figueroa, Alexander Löser, Daniel Truhn, and Keno K Bressem. 2023. MedAlpaca–an open-source collection of medical conversational AI models and training data.arXiv preprint arXiv:2304.08247(2023)

work page arXiv 2023
[23]

Yue Jiang, Jiawei Chen, Dingkang Yang, Mingcheng Li, Shunli Wang, Tong Wu, Ke Li, and Lihua Zhang. 2025. Comt: Chain-of-medical-thought reduces hallu- cination in medical report generation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

2025
[24]

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences11, 14 (2021), 6421

2021
[25]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2567–2577

2019
[26]

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. 2020. Mimic-iv.PhysioNet. A vailable online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021)(2020), 49–55

2020
[27]

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data3, 1 (2016), 1–9

2016
[28]

Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, and others. 2025. Medical hallucinations in foundation models and their impact on healthcare.arXiv preprint arXiv:2503.05777(2025)

work page arXiv 2025
[29]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521– 3526

2017
[30]

Sunjun Kweon, Jiyoun Kim, Heeyoung Kwak, Dongchul Cha, Hangyul Yoon, Kwang Kim, Jeewon Yang, Seunghyun Won, and Edward Choi. 2024. Ehrnoteqa: An llm benchmark for real-world clinical practice using discharge summaries. Advances in Neural Information Processing Systems37 (2024), 124575–124611

2024
[31]

Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-antoine Gourraud, Mick- aël Rouvier, and Richard Dufour. 2024. BioMistral: a collection of open-source pretrained large language models for medical domains. In62th annual meeting of the association for computational linguistics (ACL’24)

2024
[32]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, and others. 2020. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

2020
[33]

Yuchong Li, Xiaojun Zeng, Chihua Fang, Jian Yang, Fucang Jia, and Lei Zhang
[34]

ClinBench-HPB: a clinical benchmark for evaluating llms in hepato- pancreato-biliary diseases.arXiv preprint arXiv:2506.00095(2025)

work page arXiv 2025
[35]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing. 2511–2522

2023
[36]

Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, Vaiva Imbrasaite, and Vincent Y Zhao. 2024. Dr. ICL: Demonstration-retrieved in-context learning.Data Intelligence6, 4 (2024), 909–922

2024
[37]

Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. BioGPT: generative pre-trained transformer for biomedical text generation and mining.Briefings in Bioinformatics23, 6 (2022)

2022
[38]

Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping, Timo- thy S Chang, and Wei Wang. 2024. Clibench: A multifaceted and multigranular evaluation of large language models for clinical decision making.arXiv preprint arXiv:2406.09923(2024)

work page arXiv 2024
[39]

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, et al . 2022. Large dual encoders are generalizable retrievers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 9844–9855

2022
[40]

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2023. Med- halt: Medical domain hallucination test for large language models.arXiv preprint arXiv:2307.15343(2023)

work page arXiv 2023
[41]

Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, and Ying Ding. 2025. Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models.arXiv preprint arXiv:2502.14302 (2025)

work page arXiv 2025
[42]

2009.The probabilistic relevance frame- work: BM25 and beyond

Stephen Robertson and Hugo Zaragoza. 2009.The probabilistic relevance frame- work: BM25 and beyond. Vol. 4. Now Publishers Inc

2009
[43]

Walter J Rogan and Beth Gladen. 1978. Estimating prevalence from the results of a screening test.American journal of epidemiology107, 1 (1978), 71–76

1978
[44]

Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2022. Learning to retrieve prompts for in-context learning. InProceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies. 2655–2671

2022
[45]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, and others. 2023. Large language models encode clinical knowledge.Nature620, 7972 (2023), 172–180

2023
[46]

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al . 2025. Toward expert-level medical question answering with large language models. Nature medicine31, 3 (2025), 943–950

2025
[47]

Jun Sun, Yiteng Pan, and Xiaohu Yan. 2025. Improving intermediate reasoning in zero-shot chain-of-thought for large language models with filter supervisor-self correction.Neurocomputing620 (2025), 129219

2025
[48]

Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975(2023)

work page arXiv 2023
[49]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou
[50]

Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems33 (2020), 5776–5788

2020
[51]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in neural information processing systems 35 (2022), 24824–24837

2022
[52]

Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. 2024. PMC-LLaMA: toward building open-source language models for medicine.Journal of the American Medical Informatics Association31, 9 (2024), 1833–1843

2024
[53]

Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J Tao, Min Woo Sun, Alejandro Lozano, and James Zou. 2025. Med- CaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733(2025)

work page arXiv 2025
[54]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Cyril Zakka, Rohan Shad, Akash Chaurasia, Alex R Dalal, Jennifer L Kim, Michael Moor, Robyn Fong, Curran Phillips, Kevin Alexander, Euan Ashley, and others
[57]

Nejm ai1, 2 (2024), AIoa2300068

Almanac—retrieval-augmented language models for clinical medicine. Nejm ai1, 2 (2024), AIoa2300068

2024
[58]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623. Liao et al. A Supplementary Materials The following supplementary materials provide ex...

2023
[59]

These models are particularly attractive to healthcare institutions because they can run locally while maintaining competitive performance

and QwQ-32B [ 49] have shown promising performance in complex diagnostic reasoning tasks. These models are particularly attractive to healthcare institutions because they can run locally while maintaining competitive performance. B.2 Study of Medical LLM Hallucinations A growing body of work has developed benchmarks to evaluate large language models (LLMs...
[60]

similarly evaluate patient-specific reasoning and fact verifica- tion using real hospital records. Beyond EHRs, datasets such as MedCaseReasoning [48] leverage thousands of clinical case reports from PubMed Central to evaluate diagnostic reasoning processes, while Med-HALT [36] uses standardized medical examinations and PubMed abstracts to test hallucinat...

[1] [1]

Asma Ben Abacha, Chaitanya Shivade, and Dina Demner-Fushman. 2019. Overview of the MEDIQA 2019 shared task on textual inference, question entail- ment and question answering. InProceedings of the 18th bioNLP workshop and shared task. 370–379

2019

[2] [2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, and Nishanth Sastry. 2024. Medhalu: Hallucinations in responses to healthcare queries by large language models.arXiv preprint arXiv:2409.19492 (2024)

work page arXiv 2024

[4] [4]

Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag

[5] [5]

In Proceedings of the 2022 conference on empirical methods in natural language processing

Large language models are few-shot clinical information extractors. In Proceedings of the 2022 conference on empirical methods in natural language processing. 1998–2022

2022

[6] [6]

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou

[7] [7]

What learning algorithm is in-context learning? investigations with linear models.arXiv preprint arXiv:2211.15661(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering.BMC bioinformatics20, 1 (2019), 511

2019

[9] [9]

Stephanie Cabral, Daniel Restrepo, Zahir Kanjee, Philip Wilson, Byron Crowe, Raja-Elie Abdulnour, and Adam Rodman. 2024. Clinical reasoning of a generative artificial intelligence model compared with physicians.JAMA internal medicine 184, 5 (2024), 581–583

2024

[10] [10]

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bon- net, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, and others. 2023. Meditron-70b: Scaling medical pre- training for large language models.arXiv preprint arXiv:2311.16079(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Philip Chung, Akshay Swaminathan, Alex J Goodell, Yeasul Kim, S Mom- sen Reincke, Lichy Han, Ben Deverett, Mohammad Amin Sadeghi, Abdel-Badih Ariss, Marc Ghanem, and others. 2025. Verifying Facts in Patient Care Docu- ments Generated by Large Language Models Using Electronic Health Records. NEJM AI3, 1 (2025), AIdbp2500418

2025

[12] [12]

DeepSeek-AI. 2025. DeepSeek-R1-Distill-Llama-8B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Llama-8B

2025

[13] [13]

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2024. Chain-of-verification reduces hallucina- tion in large language models. InFindings of the association for computational linguistics: ACL 2024. 3563–3578. Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aw...

2024

[14] [14]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024). https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Philip Feldman, James R Foulds, and Shimei Pan. 2023. Trapping llm hallucina- tions using tagged context prompts.arXiv preprint arXiv:2306.06085(2023)

work page arXiv 2023

[16] [16]

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. 2022. What can transformers learn in-context? a case study of simple function classes.Ad- vances in neural information processing systems35 (2022), 30583–30598

2022

[17] [17]

Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. 2024. Does fine-tuning llms on new knowledge encourage hallucinations?. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 7765–7784

2024

[18] [18]

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio

[19] [19]

An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211(2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[20] [20]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, and others. 2024. A survey on llm-as-a-judge.The Innovation(2024)

2024

[21] [21]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. DeepSeek-R1 incen- tivizes reasoning in LLMs through reinforcement learning.Nature645, 8081 (2025), 633–638

2025

[22] [22]

Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexei Figueroa, Alexander Löser, Daniel Truhn, and Keno K Bressem. 2023. MedAlpaca–an open-source collection of medical conversational AI models and training data.arXiv preprint arXiv:2304.08247(2023)

work page arXiv 2023

[23] [23]

Yue Jiang, Jiawei Chen, Dingkang Yang, Mingcheng Li, Shunli Wang, Tong Wu, Ke Li, and Lihua Zhang. 2025. Comt: Chain-of-medical-thought reduces hallu- cination in medical report generation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

2025

[24] [24]

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences11, 14 (2021), 6421

2021

[25] [25]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2567–2577

2019

[26] [26]

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. 2020. Mimic-iv.PhysioNet. A vailable online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021)(2020), 49–55

2020

[27] [27]

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data3, 1 (2016), 1–9

2016

[28] [28]

Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, and others. 2025. Medical hallucinations in foundation models and their impact on healthcare.arXiv preprint arXiv:2503.05777(2025)

work page arXiv 2025

[29] [29]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521– 3526

2017

[30] [30]

Sunjun Kweon, Jiyoun Kim, Heeyoung Kwak, Dongchul Cha, Hangyul Yoon, Kwang Kim, Jeewon Yang, Seunghyun Won, and Edward Choi. 2024. Ehrnoteqa: An llm benchmark for real-world clinical practice using discharge summaries. Advances in Neural Information Processing Systems37 (2024), 124575–124611

2024

[31] [31]

Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-antoine Gourraud, Mick- aël Rouvier, and Richard Dufour. 2024. BioMistral: a collection of open-source pretrained large language models for medical domains. In62th annual meeting of the association for computational linguistics (ACL’24)

2024

[32] [32]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, and others. 2020. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

2020

[33] [33]

Yuchong Li, Xiaojun Zeng, Chihua Fang, Jian Yang, Fucang Jia, and Lei Zhang

[34] [34]

ClinBench-HPB: a clinical benchmark for evaluating llms in hepato- pancreato-biliary diseases.arXiv preprint arXiv:2506.00095(2025)

work page arXiv 2025

[35] [35]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing. 2511–2522

2023

[36] [36]

Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, Vaiva Imbrasaite, and Vincent Y Zhao. 2024. Dr. ICL: Demonstration-retrieved in-context learning.Data Intelligence6, 4 (2024), 909–922

2024

[37] [37]

Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. BioGPT: generative pre-trained transformer for biomedical text generation and mining.Briefings in Bioinformatics23, 6 (2022)

2022

[38] [38]

Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping, Timo- thy S Chang, and Wei Wang. 2024. Clibench: A multifaceted and multigranular evaluation of large language models for clinical decision making.arXiv preprint arXiv:2406.09923(2024)

work page arXiv 2024

[39] [39]

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, et al . 2022. Large dual encoders are generalizable retrievers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 9844–9855

2022

[40] [40]

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2023. Med- halt: Medical domain hallucination test for large language models.arXiv preprint arXiv:2307.15343(2023)

work page arXiv 2023

[41] [41]

Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, and Ying Ding. 2025. Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models.arXiv preprint arXiv:2502.14302 (2025)

work page arXiv 2025

[42] [42]

2009.The probabilistic relevance frame- work: BM25 and beyond

Stephen Robertson and Hugo Zaragoza. 2009.The probabilistic relevance frame- work: BM25 and beyond. Vol. 4. Now Publishers Inc

2009

[43] [43]

Walter J Rogan and Beth Gladen. 1978. Estimating prevalence from the results of a screening test.American journal of epidemiology107, 1 (1978), 71–76

1978

[44] [44]

Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2022. Learning to retrieve prompts for in-context learning. InProceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies. 2655–2671

2022

[45] [45]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, and others. 2023. Large language models encode clinical knowledge.Nature620, 7972 (2023), 172–180

2023

[46] [46]

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al . 2025. Toward expert-level medical question answering with large language models. Nature medicine31, 3 (2025), 943–950

2025

[47] [47]

Jun Sun, Yiteng Pan, and Xiaohu Yan. 2025. Improving intermediate reasoning in zero-shot chain-of-thought for large language models with filter supervisor-self correction.Neurocomputing620 (2025), 129219

2025

[48] [48]

Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975(2023)

work page arXiv 2023

[49] [49]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou

[50] [50]

Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems33 (2020), 5776–5788

2020

[51] [51]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in neural information processing systems 35 (2022), 24824–24837

2022

[52] [52]

Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. 2024. PMC-LLaMA: toward building open-source language models for medicine.Journal of the American Medical Informatics Association31, 9 (2024), 1833–1843

2024

[53] [53]

Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J Tao, Min Woo Sun, Alejandro Lozano, and James Zou. 2025. Med- CaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733(2025)

work page arXiv 2025

[54] [54]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

Cyril Zakka, Rohan Shad, Akash Chaurasia, Alex R Dalal, Jennifer L Kim, Michael Moor, Robyn Fong, Curran Phillips, Kevin Alexander, Euan Ashley, and others

[57] [57]

Nejm ai1, 2 (2024), AIoa2300068

Almanac—retrieval-augmented language models for clinical medicine. Nejm ai1, 2 (2024), AIoa2300068

2024

[58] [58]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623. Liao et al. A Supplementary Materials The following supplementary materials provide ex...

2023

[59] [59]

These models are particularly attractive to healthcare institutions because they can run locally while maintaining competitive performance

and QwQ-32B [ 49] have shown promising performance in complex diagnostic reasoning tasks. These models are particularly attractive to healthcare institutions because they can run locally while maintaining competitive performance. B.2 Study of Medical LLM Hallucinations A growing body of work has developed benchmarks to evaluate large language models (LLMs...

[60] [60]

similarly evaluate patient-specific reasoning and fact verifica- tion using real hospital records. Beyond EHRs, datasets such as MedCaseReasoning [48] leverage thousands of clinical case reports from PubMed Central to evaluate diagnostic reasoning processes, while Med-HALT [36] uses standardized medical examinations and PubMed abstracts to test hallucinat...