pith. sign in

arxiv: 2605.20591 · v1 · pith:VBEPQLJInew · submitted 2026-05-20 · 💻 cs.CL · cs.CY

Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

Pith reviewed 2026-05-21 05:46 UTC · model grok-4.3

classification 💻 cs.CL cs.CY
keywords medical LLMshallucination detectionpolicy complianceweb deploymentsafety evaluationfactual accuracyMedGPTAI risks
0
0 comments X

The pith

A large-scale review finds 25-30 percent of web medical language models produce inaccurate clinical advice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines thousands of medical large language models hosted on public web platforms to measure how often they generate false medical information or break safety rules. It applies two new evaluation systems to a sample of 1,500 custom medical GPTs and ten open-source models. The results indicate that lower- and middle-tier models are most likely to give wrong answers, that many ignore required operational limits, and that most action-enabled models omit proper privacy notices. Open-source models show more stable output but lower accuracy than the custom ones. The authors release a structured dataset to let others repeat and extend the checks.

Core claim

Using the MedGPT-HEval framework to score factual accuracy and an LLM pipeline to flag policy violations, the study finds that 25-30 percent of the sampled MedGPTs display low factual accuracy, with bottom- and middle-tier models at highest risk, while 33.6-54.3 percent breach operational thresholds and 57.06 percent of action-enabled models lack adequate privacy disclosures. MedGPTs score higher on factual accuracy and semantic alignment than the open-source models, yet the open-source models prove more stable overall.

What carries the argument

MedGPT-HEval framework for hallucination detection together with an LLM-based pipeline that scores policy violations and developer intent.

If this is right

  • Platforms hosting medical models would need routine multi-metric checks before allowing public access.
  • Developers of lower-tier medical GPTs would face pressure to improve accuracy or add clear disclaimers.
  • Action-enabled models would require mandatory privacy disclosures to meet basic safety standards.
  • Open-source medical models would need separate stability testing even when their accuracy trails custom versions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Clinicians using these tools for quick lookups might still need to cross-check outputs against established medical references.
  • Public web platforms could introduce warning labels or usage limits for models that fail the accuracy thresholds.
  • Extending the same checks to live patient dialogues rather than static prompts could expose additional real-world failure modes.

Load-bearing premise

The two evaluation frameworks correctly flag hallucinations and policy breaches in the sampled models without large numbers of false positives or selection bias.

What would settle it

An independent team manually reviewing the same 1,500 models and reporting substantially lower rates of factual errors or policy violations.

Figures

Figures reproduced from arXiv: 2605.20591 by Muhammad Ikram, Rahat Masood, Sunday Oyinlola Ogundoyin.

Figure 2
Figure 2. Figure 2: Top 20 authors of MedGPTs. TABLE III: Capabilities of MedGPTs. # Capabilities # GPTs Capability # GPTs 0 1,538 (24.68%) Web Search 4,435 (71.15%) 1 353 (5.66%) DALL.E Images 3,343 (53.63%) 2 2,028 (32.54%) Code Interpreter & Data Analysis 1,880 (30.16%) 3 1,833 (29.41%) Canvas 1,201 (19.27%) 4 408 (6.55%) 4o Image Generation 881 (14.13%) 5 68 (1.09%) Actions 170 (2.73%) 6 5 (0.08%) Total 6,233 A. MedGPT-HE… view at source ↗
Figure 1
Figure 1. Figure 1: ECDF of number of MedGPTs per author/developer. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative distribution of MedGPTs hallucination scores. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Correlation between hallucination scores and conversation counts of Top 1000 MedGPTs. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Correlation between conversation counts and reviews [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cumulative distribution of misuse in MedGPTs. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of authors and conversation counts for [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of privacy policy alignment scores. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates risks in 6,233 web-deployed medical GPTs (MedGPTs) plus 10 open-source LLMs by analyzing a stratified sample of 1,500 MedGPTs. It introduces the MedGPT-HEval framework for hallucination detection and an LLM-based pipeline for policy violations and developer intent. Reported results include 25-30% of MedGPTs with low factual accuracy (highest in bottom- and middle-tier models), 33.6-54.3% violating operational thresholds, and 57.06% of Action-enabled models lacking adequate privacy disclosures. MedGPTs show higher factual accuracy and semantic alignment than open-source models but lower stability. The authors release the HAA-MedGPT dataset.

Significance. If the evaluation frameworks prove reliable, the work supplies large-scale empirical data on hallucination and compliance failures in real-world medical LLMs, underscoring the need for multi-metric safeguards. Strengths include the stratified sampling across tiers, release of a structured dataset for future research, and direct comparison to open-source baselines.

major comments (2)
  1. [Methods (MedGPT-HEval framework)] Methods section describing MedGPT-HEval: the framework for hallucination detection is presented without reported human validation, inter-rater agreement metrics, expert-annotated gold standard, or ablation on prompt sensitivity. Because the headline 25-30% low factual accuracy claim and the tier-based risk comparisons rest entirely on the correctness of these automated labels, the absence of validation data is load-bearing for the central empirical results.
  2. [Methods and Results (policy violation pipeline)] Methods and Results sections on the LLM-based policy pipeline: no external benchmarks, human review, or false-positive analysis are provided for the operational thresholds or privacy-disclosure judgments. This directly affects the reliability of the 33.6-54.3% violation rates and the 57.06% privacy-gap statistic, especially given the possibility that the judge LLM may over-flag hedging language or miss subtle breaches.
minor comments (2)
  1. [Abstract] The abstract states both '25-30%' and specific ranges such as '33.6-54.3%' without clarifying whether the broader interval reflects different tiers or post-hoc adjustments; a single consistent reporting format would improve clarity.
  2. [Figures and Tables] Figure captions and table legends could more explicitly define the stratification criteria (e.g., what constitutes 'bottom-tier' vs. 'middle-tier' MedGPTs) to allow readers to assess selection bias.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate additional validation for the automated frameworks, thereby strengthening the reliability of the reported empirical results.

read point-by-point responses
  1. Referee: [Methods (MedGPT-HEval framework)] Methods section describing MedGPT-HEval: the framework for hallucination detection is presented without reported human validation, inter-rater agreement metrics, expert-annotated gold standard, or ablation on prompt sensitivity. Because the headline 25-30% low factual accuracy claim and the tier-based risk comparisons rest entirely on the correctness of these automated labels, the absence of validation data is load-bearing for the central empirical results.

    Authors: We agree that explicit validation of MedGPT-HEval is essential given its role in the factual accuracy and tier-comparison results. The current manuscript describes the automated pipeline and its application to the stratified sample but does not report human validation, inter-rater metrics, or prompt ablations. In the revised manuscript we will add a new Methods subsection presenting human validation on a stratified subset of 300 responses, including inter-rater agreement from two domain experts, a released expert-annotated gold-standard sample, and prompt-sensitivity results. These additions directly address the load-bearing concern for the 25-30% low-accuracy finding. revision: yes

  2. Referee: [Methods and Results (policy violation pipeline)] Methods and Results sections on the LLM-based policy pipeline: no external benchmarks, human review, or false-positive analysis are provided for the operational thresholds or privacy-disclosure judgments. This directly affects the reliability of the 33.6-54.3% violation rates and the 57.06% privacy-gap statistic, especially given the possibility that the judge LLM may over-flag hedging language or miss subtle breaches.

    Authors: We concur that the absence of human review and false-positive analysis limits confidence in the policy-violation and privacy statistics. The manuscript currently presents the LLM judge pipeline and aggregate rates without these checks. We will revise both the Methods and Results sections to include a human audit of 200 randomly sampled judgments, reporting precision, false-positive rates (with explicit discussion of hedging language), and any missed subtle breaches. This will support the reported 33.6-54.3% violation and 57.06% privacy-gap figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central results consist of empirical measurements obtained by applying the newly introduced MedGPT-HEval framework and LLM-based policy pipeline to a stratified sample of 1,500 MedGPTs plus 10 open-source models. No equations, fitted parameters, or self-citations are present that reduce the reported percentages (25-30% low factual accuracy, 33.6-54.3% operational violations, 57.06% privacy gaps) to self-defined quantities or inputs by construction. The evaluation chain relies on direct comparison of model outputs against external factual and policy criteria rather than any loop that re-derives the same quantities from the measurement process itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical measurement study; primary assumptions concern the validity of the new detection frameworks and the representativeness of the 1500-model stratified sample drawn from 6233 total.

axioms (1)
  • domain assumption The stratified sample of 1500 models is representative of the full population of 6233 MedGPTs for hallucination and compliance properties.
    Central to generalizing the 25-30% and 33.6-54.3% figures; stratification details not visible in abstract.

pith-pipeline@v0.9.0 · 5758 in / 1250 out tokens · 37790 ms · 2026-05-21T05:46:56.608489+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 3 internal anchors

  1. [1]

    Readability of custom chatbot vs. gpt-4 responses to otolaryngology-related patient questions,

    Y . Alsabawi, P. R. Quesada, and D. T. Rouse, “Readability of custom chatbot vs. gpt-4 responses to otolaryngology-related patient questions,”American Journal of Otolaryngology, vol. 46, no. 5, p. 104717, 2025. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0196070925001206

  2. [2]

    OpenAI GPT store,

    OpenAI, “OpenAI GPT store,” 2025. [Online]. Available: https: //openai.com/index/introducing-the-gpt-store/

  3. [3]

    Gptracker: A large-scale measurement of misused gpts,

    X. Shen, Y . Shen, M. Backes, and Y . Zhang, “Gptracker: A large-scale measurement of misused gpts,” in2025 IEEE Symposium on Security and Privacy (SP), 2025, pp. 336–354

  4. [4]

    Unsafe by design? a first look at security and privacy risks in openai’s custom gpt ecosystem,

    S. O. Ogundoyin, M. Ikram, H. J. Asghar, B. Z. H. Zhao, and D. Kaafar, “Unsafe by design? a first look at security and privacy risks in openai’s custom gpt ecosystem,” in2025 Workshop on Privacy in the Electronic Society (WPES ’25), October 13–17, 2025, Taipei, Taiwan. ACM, New York, NY, USA, 2025

  5. [5]

    Galactica: A Large Language Model for Science

    R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic, “Galactica: A large language model for science,”arXiv preprint arXiv:2211.09085, 2022

  6. [6]

    Pmc-llama: Towards building open-source language models for medicine,

    C. Wu, W. Lin, X. Zhang, Y . Zhang, Y . Wang, and W. Xie, “Pmc-llama: Towards building open-source language models for medicine,” 2023. [Online]. Available: https://arxiv.org/abs/2304.14454

  7. [7]

    Medalpaca – an open-source collection of medical conversational ai models and training data,

    T. Han, L. C. Adams, J.-M. Papaioannou, P. Grundmann, T. Oberhauser, A. Figueroa, A. L ¨oser, D. Truhn, and K. K. Bressem, “Medalpaca – an open-source collection of medical conversational ai models and training data,” 2025. [Online]. Available: https://arxiv.org/abs/2304.08247

  8. [9]

    A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,

    M. Ahmed, J. Lam, A. Chow, and C.-M. Chow, “A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,”CJC Open, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2589790X25001106

  9. [10]

    Evaluating chatgpt, gemini and other large language models (llms) in orthopaedic diagnostics: A prospective clinical study,

    S. Pagano, L. Strumolo, K. Michalk, J. Schiegl, L. C. Pulido, J. Reinhard, G. Maderbacher, T. Renkawitz, and M. Schuster, “Evaluating chatgpt, gemini and other large language models (llms) in orthopaedic diagnostics: A prospective clinical study,”Computational and Structural Biotechnology Journal, vol. 28, pp. 9–15, 2025. [Online]. Available: https://www....

  10. [11]

    A Survey on Hallucination in Large Language Models

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,”ACM Transactions on Information Systems, vol. 43, no. 2, p. 1–55, Jan. 2025. [Online]. Available: http://dx.doi.org/10.1145/3703155

  11. [12]

    Hallulens: Llm hallucination benchmark,

    Y . Bang, Z. Ji, A. Schelten, A. Hartshorn, T. Fowler, C. Zhang, N. Cancedda, and P. Fung, “Hallulens: Llm hallucination benchmark,”

  12. [13]

    Available: https://arxiv.org/abs/2504.17550

    [Online]. Available: https://arxiv.org/abs/2504.17550

  13. [14]

    Towards safer chatbots: A framework for policy compliance evaluation of custom GPTs,

    D. Rodriguez, W. Seymour, J. M. D. Alamo, and J. Such, “Towards safer chatbots: A framework for policy compliance evaluation of custom GPTs,” 2025. [Online]. Available: https://arxiv.org/abs/2502.01436

  14. [15]

    On the (In)Security of LLM App Stores,

    X. Hou, Y . Zhao, and H. Wang, “On the (In)Security of LLM App Stores,” Jul. 2024, arXiv:2407.08422 [cs]. [Online]. Available: http://arxiv.org/abs/2407.08422

  15. [18]

    Large language models encode clinical knowledge,

    S. Karan, A. Shekoofeh, T. Tao, M. S. Sara, W. Jason, W. C. Hyung, S. Nathan, T. Ajay, C.-L. Heather, P. Stephen, P. Perry, S. Martin, G. Paul, K. Chris, B. Abubakr, S. Nathanael, C. Aakanksha, M. Philip, D.-F. Dina, A. y. A. Blaise, W. Dale, S. C. Greg, M. Yossi, C. Katherine, G. Juraj, T. Nenad, L. Yun, R. Alvin, B. Joelle, S. Christopher, K. Alan, and ...

  16. [19]

    Gpt-4 technical report,

    OpenAI, J. Achiam, S. Adler, and S. A. et al., “Gpt-4 technical report,”

  17. [20]

    GPT-4 Technical Report

    [Online]. Available: https://arxiv.org/abs/2303.08774

  18. [21]

    A framework to assess clinical safety and hallucination rates of llms for medical text summarisation.npj Digital Medicine, 8:274, 2025

    E. Asgari, N. Monta ˜na-Brown, M. Dubois, S. Khalil, J. Balloch, J. A. Yeung, and D. Pimenta, “A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,”npj Digital Medicine, vol. 8, no. 1, p. 274, May 2025. [Online]. Available: https://doi.org/10.1038/s41746-025-01670-7

  19. [22]

    GPTs window shopping: An analysis of the landscape of custom ChatGPT models,

    B. Z. H. Zhao, M. Ikram, and M. A. Kaafar, “GPTs window shopping: An analysis of the landscape of custom ChatGPT models,” 2024. [Online]. Available: https://arxiv.org/abs/2405.10547

  20. [23]

    GPTApps.io, https://gptsapp.io/trending-gpts/top-1000-gpts-ranked

  21. [24]

    Gptstore.ai,

    GPTStore.AI, “Gptstore.ai,” 2025. [Online]. Available: https://gptstore. ai/

  22. [25]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

    D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, 2021. [Online]. Available: https://www.mdpi.com/ 2076-3417/11/14/6421

  23. [26]

    G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment,

    L. Yang, I. Dan, X. Yichong, W. Shuohang, X. Ruochen, and Z. Chen- guang, “G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment,” in2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 2511–2522

  24. [27]

    Bartscore: evaluating generated text as text generation,

    W. Yuan, G. Neubig, and P. Liu, “Bartscore: evaluating generated text as text generation,” inProceedings of the 35th International Conference on Neural Information Processing Systems, ser. NIPS ’21. Red Hook, NY , USA: Curran Associates Inc., 2021

  25. [28]

    Measuring semantic entropy,

    I. D. Melamed, “Measuring semantic entropy,” inTagging Text with Lexical Semantics: Why, What, and How?, 1997. [Online]. Available: https://aclanthology.org/W97-0207/

  26. [29]

    Chapter 4 - classification,

    V . Kotu and B. Deshpande, “Chapter 4 - classification,” inData Science (Second Edition), second edition ed., V . Kotu and B. Deshpande, Eds. Morgan Kaufmann, 2019, pp. 65–163. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/B9780128147610000046

  27. [30]

    A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,

    M. Ahmed, J. Lam, A. Chow, and C.-M. Chow, “A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,”CJC Open, vol. 7, no. 5, pp. 660–666, Dec. 2025. [Online]. Available: https://doi.org/10.1016/j.cjco.2025.02.012

  28. [31]

    Use of a large language model (llm) for ambulance dispatch and triage,

    A. C. Shekhar, J. Kimbrell, A. Saharan, J. Stebel, E. Ashley, and E. E. Abbott, “Use of a large language model (llm) for ambulance dispatch and triage,”The American Journal of Emergency Medicine, vol. 89, pp. 27–29, 2025. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0735675724007150

  29. [32]

    Assessment of large language models (llms) in decision-making support for gynecologic oncology,

    K. E. Gumilar, B. R. Indraprasta, A. S. Faridzi, B. M. Wibowo, A. Herlambang, E. Rahestyningtyas, B. Irawan, Z. Tambunan, A. F. Bustomi, B. N. Brahmantara, Z.-Y . Yu, Y .-C. Hsu, H. Pramuditya, V . G. E. Putra, H. Nugroho, P. Mulawardhana, B. A. Tjokroprawiro, T. Hedianto, I. H. Ibrahim, J. Huang, D. Li, C.-H. Lu, J.-Y . Yang, L.-N. Liao, and M. Tan, “Ass...

  30. [34]

    The application of large language models in medicine: A scoping review,

    X. Meng, X. Yan, K. Zhang, D. Liu, X. Cui, Y . Yang, M. Zhang, C. Cao, J. Wang, X. Wang, J. Gao, Y .-G.-S. Wang, J. ming Ji, Z. Qiu, M. Li, C. Qian, T. Guo, S. Ma, Z. Wang, Z. Guo, Y . Lei, C. Shao, W. Wang, H. Fan, and Y .-D. Tang, “The application of large language models in medicine: A scoping review,” iScience, vol. 27, no. 5, p. 109713, 2024. [Online...

  31. [35]

    Health-llm: Personalized retrieval- augmented disease prediction system,

    Q. Yu, M. Jin, D. Shu, C. Zhang, L. Fan, W. Hua, S. Zhu, Y . Meng, Z. Wang, M. Du, and Y . Zhang, “Health-llm: Personalized retrieval- augmented disease prediction system,” 2025. [Online]. Available: https://arxiv.org/abs/2402.00746

  32. [36]

    Polaris: A safety-focused llm constellation architecture for healthcare,

    S. Mukherjee, P. Gamble, M. S. Ausin, N. Kant, K. Aggarwal, N. Manjunath, D. Datta, Z. Liu, J. Ding, S. Busacca, C. Bianco, S. Sharma, R. Lasko, M. V oisard, S. Harneja, D. Filippova, G. Meixiong, K. Cha, A. Youssefi, M. Buvanesh, H. Weingram, S. Bierman-Lytle, H. S. Mangat, K. Parikh, S. Godil, and A. Miller, “Polaris: A safety-focused llm constellation ...

  33. [37]

    Available: https://arxiv.org/abs/2403.13313

    [Online]. Available: https://arxiv.org/abs/2403.13313

  34. [38]

    Medical hallucinations in foundation models and their impact on healthcare,

    Y . Kim, H. Jeong, S. Chen, S. S. Li, M. Lu, K. Alhamoud, J. Mun, C. Grau, M. Jung, R. Gameiro, L. Fan, E. Park, T. Lin, J. Yoon, W. Yoon, M. Sap, Y . Tsvetkov, P. Liang, X. Xu, X. Liu, D. McDuff, H. Lee, H. W. Park, S. Tulebaev, and C. Breazeal, “Medical hallucinations in foundation models and their impact on healthcare,”

  35. [39]

    Available: https://arxiv.org/abs/2503.05777

    [Online]. Available: https://arxiv.org/abs/2503.05777

  36. [40]

    Faithfulness hallucination detection in healthcare AI,

    P. R. Vishwanath, S. Tiwari, T. G. Naik, S. Gupta, D. N. Thai, W. Zhao, S. KWON, V . Ardulov, K. Tarabishy, A. McCallum, and W. Salloum, “Faithfulness hallucination detection in healthcare AI,” in Artificial Intelligence and Data Science for Healthcare: Bridging Data- Centric AI and People-Centric Healthcare, 2024. [Online]. Available: https://openreview....

  37. [41]

    Listening to patients: A framework of detecting and mitigating patient misreport for medical dialogue generation,

    L. Qin, Y . Zhang, H. Liang, A. Jatowt, and Z. Yang, “Listening to patients: A framework of detecting and mitigating patient misreport for medical dialogue generation,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06094

  38. [42]

    Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models,

    Z. Zhu, Y . Zhang, X. Zhuang, F. Zhang, Z. Wan, Y . Chen, Q. QingqingLong, Y . Zheng, and X. Wu, “Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: As...

  39. [43]

    Medhalu: Hallucinations in responses to healthcare queries by large language models,

    V . Agarwal, Y . Jin, M. Chandra, M. D. Choudhury, S. Kumar, and N. Sastry, “Medhalu: Hallucinations in responses to healthcare queries by large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2409.19492

  40. [44]

    A first look at gpt apps: Landscape and vulnerability,

    Z. Zhang, L. Zhang, X. Yuan, A. Zhang, M. Xu, and F. Qian, “A first look at gpt apps: Landscape and vulnerability,” 2024. [Online]. Available: https://arxiv.org/abs/2402.15105

  41. [45]

    Hallucinations: Clinical aspects and management

    C. S, “Hallucinations: Clinical aspects and management.”Industrial psychiatry journal, vol. 19, no. 1, pp. 5–12, 2010. [Online]. Available: https://doi.org/10.4103/0972-6748.77625

  42. [46]

    Addressing cognitive bias in medical language models.arXiv preprint arXiv:2402.08113,

    S. Schmidgall, C. Harris, I. Essien, D. Olshvang, T. Rahman, J. W. Kim, R. Ziaei, J. Eshraghian, P. Abadir, and R. Chellappa, “Addressing cognitive bias in medical language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.08113

  43. [47]

    Med-halt: Medical domain hallucination test for large language models,

    A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Med-halt: Medical domain hallucination test for large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.15343

  44. [48]

    A survey of large language models in medicine: Progress, application, and challenge,

    H. Zhou, F. Liu, B. Gu, X. Zou, J. Huang, J. Wu, Y . Li, S. S. Chen, P. Zhou, J. Liu, Y . Hua, C. Mao, C. You, X. Wu, Y . Zheng, L. Clifton, Z. Li, J. Luo, and D. A. Clifton, “A survey of large language models in medicine: Progress, application, and challenge,” 2024. [Online]. Available: https://arxiv.org/abs/2311.05112

  45. [49]

    A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics,

    K. He, R. Mao, Q. Lin, Y . Ruan, X. Lan, M. Feng, and E. Cambria, “A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics,” Information Fusion, vol. 118, p. 102963, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253525000363

  46. [50]

    A survey on medical large language models: Technology, application, trustworthiness, and future directions,

    L. Liu, X. Yang, J. Lei, X. Liu, Y . Shen, Z. Zhang, P. Wei, J. Gu, Z. Chu, Z. Qin, and K. Ren, “A survey on medical large language models: Technology, application, trustworthiness, and future directions,”ArXiv, vol. abs/2406.03712, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:270285974

  47. [51]

    Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge,

    Y . Li, Z. Li, K. Zhang, R. Dan, S. Jiang, and Y . Zhang, “Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge,” 2023. [Online]. Available: https://arxiv.org/abs/2303.14070

  48. [52]

    Jmlr: Joint medical llm and retrieval training for enhancing reasoning and professional question answering capability,

    J. Wang, Z. Yang, Z. Yao, and H. Yu, “Jmlr: Joint medical llm and retrieval training for enhancing reasoning and professional question answering capability,”arXiv preprint arXiv:2402.17887, 2024

  49. [53]

    BioMistral: A collection of open-source pretrained large language models for medical domains,

    Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, and R. Dufour, “BioMistral: A collection of open-source pretrained large language models for medical domains,” 2024

  50. [54]

    Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people,

    X. Wang, N. Chen, J. Chen, Y . Hu, Y . Wang, X. Wu, A. Gao, X. Wan, H. Li, and B. Wang, “Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people,” 2024

  51. [55]

    Aloe: A family of fine-tuned open healthcare llms,

    A. K. Gururajan, E. Lopez-Cuena, J. Bayarri-Planas, A. Tormos, D. Hin- jos, P. Bernabeu-Perez, A. Arias-Duart, P. A. Martin-Torres, L. Urcelay- Ganzabal, M. Gonzalez-Mallo, S. Alvarez-Napagao, E. Ayguad ´e-Parra, and U. C. D. Garcia-Gasulla, “Aloe: A family of fine-tuned open healthcare llms,” 2024

  52. [56]

    Mental health therapy chatbot,

    Tanusri, “Mental health therapy chatbot,” 2024. [Online]. Available: https://huggingface.co/tanusrich/Mental Health Chatbot

  53. [57]

    Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue,

    S. Yang, H. Zhao, S. Zhu, G. Zhou, H. Xu, Y . Jia, and H. Zan, “Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue,” 2023. [Online]. Available: https://arxiv.org/abs/2308.03549

  54. [58]

    Selenium webdriver,

    Selenium Project, “Selenium webdriver,” https://www.selenium.dev, 2024, accessed: 2025-04-15

  55. [59]

    Biobart: Pretraining and evaluation of a biomedical generative language model,

    H. Yuan, Z. Yuan, R. Gan, J. Zhang, Y . Xie, and S. Yu, “Biobart: Pretraining and evaluation of a biomedical generative language model,” 2022

  56. [60]

    Farquhar, J

    S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, no. 8017, pp. 625–630, 2024. [Online]. Available: https: //doi.org/10.1038/s41586-024-07421-0

  57. [61]

    Privacy policy,

    OpenAI, “Privacy policy,” June 27, 2025. [Online]. Available: https://openai.com/policies/row-privacy-policy/

  58. [62]

    A Survey on LLM-as-a-Judge

    J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y . Wang, W. Gao, L. Ni, and J. Guo, “A survey on llm-as-a-judge,” 2025. [Online]. Available: https://arxiv.org/abs/2411.15594

  59. [63]

    Gemini 3.1 Pro: A smarter model for your most complex tasks,

    “Gemini 3.1 Pro: A smarter model for your most complex tasks,” Feb. 2026. [Online]. Available: https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/

  60. [64]

    Theoretical foundations and mitigation of hallucination in large language models,

    E. Gumaan, “Theoretical foundations and mitigation of hallucination in large language models,” 2025. [Online]. Available: https://arxiv.org/ abs/2507.22915

  61. [65]

    A stitch in time saves nine: Detecting and mitigating hallu- cinations of llms by validating low-confidence generation

    N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation,” 2023. [Online]. Available: https://arxiv.org/abs/2307.03987

  62. [66]

    RARR: Researching and revising what language models say, using language models,

    L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty, Y . Fan, V . Zhao, N. Lao, H. Lee, D.-C. Juan, and K. Guu, “RARR: Researching and revising what language models say, using language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. ...

  63. [67]

    Creating trustworthy llms: Dealing with hallucinations in healthcare ai,

    M. A. Ahmad, I. Yaramis, and T. D. Roy, “Creating trustworthy llms: Dealing with hallucinations in healthcare ai,” 2023. [Online]. Available: https://arxiv.org/abs/2311.01463 TABLE IX: Operationalized policies and context for forbidden scenarios. Proscribed case Context Policy description Health consultation Using these models to advise someone on possibl...