Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

Muhammad Ikram; Rahat Masood; Sunday Oyinlola Ogundoyin

arxiv: 2605.20591 · v1 · pith:VBEPQLJInew · submitted 2026-05-20 · 💻 cs.CL · cs.CY

Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

Sunday Oyinlola Ogundoyin , Muhammad Ikram , Rahat Masood This is my paper

Pith reviewed 2026-05-21 05:46 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords medical LLMshallucination detectionpolicy complianceweb deploymentsafety evaluationfactual accuracyMedGPTAI risks

0 comments

The pith

A large-scale review finds 25-30 percent of web medical language models produce inaccurate clinical advice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines thousands of medical large language models hosted on public web platforms to measure how often they generate false medical information or break safety rules. It applies two new evaluation systems to a sample of 1,500 custom medical GPTs and ten open-source models. The results indicate that lower- and middle-tier models are most likely to give wrong answers, that many ignore required operational limits, and that most action-enabled models omit proper privacy notices. Open-source models show more stable output but lower accuracy than the custom ones. The authors release a structured dataset to let others repeat and extend the checks.

Core claim

Using the MedGPT-HEval framework to score factual accuracy and an LLM pipeline to flag policy violations, the study finds that 25-30 percent of the sampled MedGPTs display low factual accuracy, with bottom- and middle-tier models at highest risk, while 33.6-54.3 percent breach operational thresholds and 57.06 percent of action-enabled models lack adequate privacy disclosures. MedGPTs score higher on factual accuracy and semantic alignment than the open-source models, yet the open-source models prove more stable overall.

What carries the argument

MedGPT-HEval framework for hallucination detection together with an LLM-based pipeline that scores policy violations and developer intent.

If this is right

Platforms hosting medical models would need routine multi-metric checks before allowing public access.
Developers of lower-tier medical GPTs would face pressure to improve accuracy or add clear disclaimers.
Action-enabled models would require mandatory privacy disclosures to meet basic safety standards.
Open-source medical models would need separate stability testing even when their accuracy trails custom versions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clinicians using these tools for quick lookups might still need to cross-check outputs against established medical references.
Public web platforms could introduce warning labels or usage limits for models that fail the accuracy thresholds.
Extending the same checks to live patient dialogues rather than static prompts could expose additional real-world failure modes.

Load-bearing premise

The two evaluation frameworks correctly flag hallucinations and policy breaches in the sampled models without large numbers of false positives or selection bias.

What would settle it

An independent team manually reviewing the same 1,500 models and reporting substantially lower rates of factual errors or policy violations.

Figures

Figures reproduced from arXiv: 2605.20591 by Muhammad Ikram, Rahat Masood, Sunday Oyinlola Ogundoyin.

**Figure 2.** Figure 2: Top 20 authors of MedGPTs. TABLE III: Capabilities of MedGPTs. # Capabilities # GPTs Capability # GPTs 0 1,538 (24.68%) Web Search 4,435 (71.15%) 1 353 (5.66%) DALL.E Images 3,343 (53.63%) 2 2,028 (32.54%) Code Interpreter & Data Analysis 1,880 (30.16%) 3 1,833 (29.41%) Canvas 1,201 (19.27%) 4 408 (6.55%) 4o Image Generation 881 (14.13%) 5 68 (1.09%) Actions 170 (2.73%) 6 5 (0.08%) Total 6,233 A. MedGPT-HE… view at source ↗

**Figure 1.** Figure 1: ECDF of number of MedGPTs per author/developer. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 3.** Figure 3: Cumulative distribution of MedGPTs hallucination scores. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Correlation between hallucination scores and conversation counts of Top 1000 MedGPTs. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Correlation between conversation counts and reviews [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Cumulative distribution of misuse in MedGPTs. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Distribution of authors and conversation counts for [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of privacy policy alignment scores. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates risks in 6,233 web-deployed medical GPTs (MedGPTs) plus 10 open-source LLMs by analyzing a stratified sample of 1,500 MedGPTs. It introduces the MedGPT-HEval framework for hallucination detection and an LLM-based pipeline for policy violations and developer intent. Reported results include 25-30% of MedGPTs with low factual accuracy (highest in bottom- and middle-tier models), 33.6-54.3% violating operational thresholds, and 57.06% of Action-enabled models lacking adequate privacy disclosures. MedGPTs show higher factual accuracy and semantic alignment than open-source models but lower stability. The authors release the HAA-MedGPT dataset.

Significance. If the evaluation frameworks prove reliable, the work supplies large-scale empirical data on hallucination and compliance failures in real-world medical LLMs, underscoring the need for multi-metric safeguards. Strengths include the stratified sampling across tiers, release of a structured dataset for future research, and direct comparison to open-source baselines.

major comments (2)

[Methods (MedGPT-HEval framework)] Methods section describing MedGPT-HEval: the framework for hallucination detection is presented without reported human validation, inter-rater agreement metrics, expert-annotated gold standard, or ablation on prompt sensitivity. Because the headline 25-30% low factual accuracy claim and the tier-based risk comparisons rest entirely on the correctness of these automated labels, the absence of validation data is load-bearing for the central empirical results.
[Methods and Results (policy violation pipeline)] Methods and Results sections on the LLM-based policy pipeline: no external benchmarks, human review, or false-positive analysis are provided for the operational thresholds or privacy-disclosure judgments. This directly affects the reliability of the 33.6-54.3% violation rates and the 57.06% privacy-gap statistic, especially given the possibility that the judge LLM may over-flag hedging language or miss subtle breaches.

minor comments (2)

[Abstract] The abstract states both '25-30%' and specific ranges such as '33.6-54.3%' without clarifying whether the broader interval reflects different tiers or post-hoc adjustments; a single consistent reporting format would improve clarity.
[Figures and Tables] Figure captions and table legends could more explicitly define the stratification criteria (e.g., what constitutes 'bottom-tier' vs. 'middle-tier' MedGPTs) to allow readers to assess selection bias.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate additional validation for the automated frameworks, thereby strengthening the reliability of the reported empirical results.

read point-by-point responses

Referee: [Methods (MedGPT-HEval framework)] Methods section describing MedGPT-HEval: the framework for hallucination detection is presented without reported human validation, inter-rater agreement metrics, expert-annotated gold standard, or ablation on prompt sensitivity. Because the headline 25-30% low factual accuracy claim and the tier-based risk comparisons rest entirely on the correctness of these automated labels, the absence of validation data is load-bearing for the central empirical results.

Authors: We agree that explicit validation of MedGPT-HEval is essential given its role in the factual accuracy and tier-comparison results. The current manuscript describes the automated pipeline and its application to the stratified sample but does not report human validation, inter-rater metrics, or prompt ablations. In the revised manuscript we will add a new Methods subsection presenting human validation on a stratified subset of 300 responses, including inter-rater agreement from two domain experts, a released expert-annotated gold-standard sample, and prompt-sensitivity results. These additions directly address the load-bearing concern for the 25-30% low-accuracy finding. revision: yes
Referee: [Methods and Results (policy violation pipeline)] Methods and Results sections on the LLM-based policy pipeline: no external benchmarks, human review, or false-positive analysis are provided for the operational thresholds or privacy-disclosure judgments. This directly affects the reliability of the 33.6-54.3% violation rates and the 57.06% privacy-gap statistic, especially given the possibility that the judge LLM may over-flag hedging language or miss subtle breaches.

Authors: We concur that the absence of human review and false-positive analysis limits confidence in the policy-violation and privacy statistics. The manuscript currently presents the LLM judge pipeline and aggregate rates without these checks. We will revise both the Methods and Results sections to include a human audit of 200 randomly sampled judgments, reporting precision, false-positive rates (with explicit discussion of hedging language), and any missed subtle breaches. This will support the reported 33.6-54.3% violation and 57.06% privacy-gap figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central results consist of empirical measurements obtained by applying the newly introduced MedGPT-HEval framework and LLM-based policy pipeline to a stratified sample of 1,500 MedGPTs plus 10 open-source models. No equations, fitted parameters, or self-citations are present that reduce the reported percentages (25-30% low factual accuracy, 33.6-54.3% operational violations, 57.06% privacy gaps) to self-defined quantities or inputs by construction. The evaluation chain relies on direct comparison of model outputs against external factual and policy criteria rather than any loop that re-derives the same quantities from the measurement process itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical measurement study; primary assumptions concern the validity of the new detection frameworks and the representativeness of the 1500-model stratified sample drawn from 6233 total.

axioms (1)

domain assumption The stratified sample of 1500 models is representative of the full population of 6233 MedGPTs for hallucination and compliance properties.
Central to generalizing the 25-30% and 33.6-54.3% figures; stratification details not visible in abstract.

pith-pipeline@v0.9.0 · 5758 in / 1250 out tokens · 37790 ms · 2026-05-21T05:46:56.608489+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 3 internal anchors

[1]

Readability of custom chatbot vs. gpt-4 responses to otolaryngology-related patient questions,

Y . Alsabawi, P. R. Quesada, and D. T. Rouse, “Readability of custom chatbot vs. gpt-4 responses to otolaryngology-related patient questions,”American Journal of Otolaryngology, vol. 46, no. 5, p. 104717, 2025. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0196070925001206

work page 2025
[2]

OpenAI GPT store,

OpenAI, “OpenAI GPT store,” 2025. [Online]. Available: https: //openai.com/index/introducing-the-gpt-store/

work page 2025
[3]

Gptracker: A large-scale measurement of misused gpts,

X. Shen, Y . Shen, M. Backes, and Y . Zhang, “Gptracker: A large-scale measurement of misused gpts,” in2025 IEEE Symposium on Security and Privacy (SP), 2025, pp. 336–354

work page 2025
[4]

Unsafe by design? a first look at security and privacy risks in openai’s custom gpt ecosystem,

S. O. Ogundoyin, M. Ikram, H. J. Asghar, B. Z. H. Zhao, and D. Kaafar, “Unsafe by design? a first look at security and privacy risks in openai’s custom gpt ecosystem,” in2025 Workshop on Privacy in the Electronic Society (WPES ’25), October 13–17, 2025, Taipei, Taiwan. ACM, New York, NY, USA, 2025

work page 2025
[5]

Galactica: A Large Language Model for Science

R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic, “Galactica: A large language model for science,”arXiv preprint arXiv:2211.09085, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Pmc-llama: Towards building open-source language models for medicine,

C. Wu, W. Lin, X. Zhang, Y . Zhang, Y . Wang, and W. Xie, “Pmc-llama: Towards building open-source language models for medicine,” 2023. [Online]. Available: https://arxiv.org/abs/2304.14454

work page arXiv 2023
[7]

Medalpaca – an open-source collection of medical conversational ai models and training data,

T. Han, L. C. Adams, J.-M. Papaioannou, P. Grundmann, T. Oberhauser, A. Figueroa, A. L ¨oser, D. Truhn, and K. K. Bressem, “Medalpaca – an open-source collection of medical conversational ai models and training data,” 2025. [Online]. Available: https://arxiv.org/abs/2304.08247

work page arXiv 2025
[9]

A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,

M. Ahmed, J. Lam, A. Chow, and C.-M. Chow, “A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,”CJC Open, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2589790X25001106

work page 2025
[10]

Evaluating chatgpt, gemini and other large language models (llms) in orthopaedic diagnostics: A prospective clinical study,

S. Pagano, L. Strumolo, K. Michalk, J. Schiegl, L. C. Pulido, J. Reinhard, G. Maderbacher, T. Renkawitz, and M. Schuster, “Evaluating chatgpt, gemini and other large language models (llms) in orthopaedic diagnostics: A prospective clinical study,”Computational and Structural Biotechnology Journal, vol. 28, pp. 9–15, 2025. [Online]. Available: https://www....

work page 2025
[11]

A Survey on Hallucination in Large Language Models

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,”ACM Transactions on Information Systems, vol. 43, no. 2, p. 1–55, Jan. 2025. [Online]. Available: http://dx.doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025
[12]

Hallulens: Llm hallucination benchmark,

Y . Bang, Z. Ji, A. Schelten, A. Hartshorn, T. Fowler, C. Zhang, N. Cancedda, and P. Fung, “Hallulens: Llm hallucination benchmark,”

work page
[13]

Available: https://arxiv.org/abs/2504.17550

[Online]. Available: https://arxiv.org/abs/2504.17550

work page arXiv
[14]

Towards safer chatbots: A framework for policy compliance evaluation of custom GPTs,

D. Rodriguez, W. Seymour, J. M. D. Alamo, and J. Such, “Towards safer chatbots: A framework for policy compliance evaluation of custom GPTs,” 2025. [Online]. Available: https://arxiv.org/abs/2502.01436

work page arXiv 2025
[15]

On the (In)Security of LLM App Stores,

X. Hou, Y . Zhao, and H. Wang, “On the (In)Security of LLM App Stores,” Jul. 2024, arXiv:2407.08422 [cs]. [Online]. Available: http://arxiv.org/abs/2407.08422

work page arXiv 2024
[18]

Large language models encode clinical knowledge,

S. Karan, A. Shekoofeh, T. Tao, M. S. Sara, W. Jason, W. C. Hyung, S. Nathan, T. Ajay, C.-L. Heather, P. Stephen, P. Perry, S. Martin, G. Paul, K. Chris, B. Abubakr, S. Nathanael, C. Aakanksha, M. Philip, D.-F. Dina, A. y. A. Blaise, W. Dale, S. C. Greg, M. Yossi, C. Katherine, G. Juraj, T. Nenad, L. Yun, R. Alvin, B. Joelle, S. Christopher, K. Alan, and ...

work page doi:10.1038/s41586-023-06291-2 2023
[19]

Gpt-4 technical report,

OpenAI, J. Achiam, S. Adler, and S. A. et al., “Gpt-4 technical report,”

work page
[20]

GPT-4 Technical Report

[Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv
[21]

A framework to assess clinical safety and hallucination rates of llms for medical text summarisation.npj Digital Medicine, 8:274, 2025

E. Asgari, N. Monta ˜na-Brown, M. Dubois, S. Khalil, J. Balloch, J. A. Yeung, and D. Pimenta, “A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,”npj Digital Medicine, vol. 8, no. 1, p. 274, May 2025. [Online]. Available: https://doi.org/10.1038/s41746-025-01670-7

work page doi:10.1038/s41746-025-01670-7 2025
[22]

GPTs window shopping: An analysis of the landscape of custom ChatGPT models,

B. Z. H. Zhao, M. Ikram, and M. A. Kaafar, “GPTs window shopping: An analysis of the landscape of custom ChatGPT models,” 2024. [Online]. Available: https://arxiv.org/abs/2405.10547

work page arXiv 2024
[23]

GPTApps.io, https://gptsapp.io/trending-gpts/top-1000-gpts-ranked

work page
[24]

Gptstore.ai,

GPTStore.AI, “Gptstore.ai,” 2025. [Online]. Available: https://gptstore. ai/

work page 2025
[25]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, 2021. [Online]. Available: https://www.mdpi.com/ 2076-3417/11/14/6421

work page 2021
[26]

G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment,

L. Yang, I. Dan, X. Yichong, W. Shuohang, X. Ruochen, and Z. Chen- guang, “G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment,” in2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 2511–2522

work page 2023
[27]

Bartscore: evaluating generated text as text generation,

W. Yuan, G. Neubig, and P. Liu, “Bartscore: evaluating generated text as text generation,” inProceedings of the 35th International Conference on Neural Information Processing Systems, ser. NIPS ’21. Red Hook, NY , USA: Curran Associates Inc., 2021

work page 2021
[28]

Measuring semantic entropy,

I. D. Melamed, “Measuring semantic entropy,” inTagging Text with Lexical Semantics: Why, What, and How?, 1997. [Online]. Available: https://aclanthology.org/W97-0207/

work page 1997
[29]

Chapter 4 - classification,

V . Kotu and B. Deshpande, “Chapter 4 - classification,” inData Science (Second Edition), second edition ed., V . Kotu and B. Deshpande, Eds. Morgan Kaufmann, 2019, pp. 65–163. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/B9780128147610000046

work page 2019
[30]

A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,

M. Ahmed, J. Lam, A. Chow, and C.-M. Chow, “A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,”CJC Open, vol. 7, no. 5, pp. 660–666, Dec. 2025. [Online]. Available: https://doi.org/10.1016/j.cjco.2025.02.012

work page doi:10.1016/j.cjco.2025.02.012 2025
[31]

Use of a large language model (llm) for ambulance dispatch and triage,

A. C. Shekhar, J. Kimbrell, A. Saharan, J. Stebel, E. Ashley, and E. E. Abbott, “Use of a large language model (llm) for ambulance dispatch and triage,”The American Journal of Emergency Medicine, vol. 89, pp. 27–29, 2025. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0735675724007150

work page 2025
[32]

Assessment of large language models (llms) in decision-making support for gynecologic oncology,

K. E. Gumilar, B. R. Indraprasta, A. S. Faridzi, B. M. Wibowo, A. Herlambang, E. Rahestyningtyas, B. Irawan, Z. Tambunan, A. F. Bustomi, B. N. Brahmantara, Z.-Y . Yu, Y .-C. Hsu, H. Pramuditya, V . G. E. Putra, H. Nugroho, P. Mulawardhana, B. A. Tjokroprawiro, T. Hedianto, I. H. Ibrahim, J. Huang, D. Li, C.-H. Lu, J.-Y . Yang, L.-N. Liao, and M. Tan, “Ass...

work page 2024
[34]

The application of large language models in medicine: A scoping review,

X. Meng, X. Yan, K. Zhang, D. Liu, X. Cui, Y . Yang, M. Zhang, C. Cao, J. Wang, X. Wang, J. Gao, Y .-G.-S. Wang, J. ming Ji, Z. Qiu, M. Li, C. Qian, T. Guo, S. Ma, Z. Wang, Z. Guo, Y . Lei, C. Shao, W. Wang, H. Fan, and Y .-D. Tang, “The application of large language models in medicine: A scoping review,” iScience, vol. 27, no. 5, p. 109713, 2024. [Online...

work page 2024
[35]

Health-llm: Personalized retrieval- augmented disease prediction system,

Q. Yu, M. Jin, D. Shu, C. Zhang, L. Fan, W. Hua, S. Zhu, Y . Meng, Z. Wang, M. Du, and Y . Zhang, “Health-llm: Personalized retrieval- augmented disease prediction system,” 2025. [Online]. Available: https://arxiv.org/abs/2402.00746

work page arXiv 2025
[36]

Polaris: A safety-focused llm constellation architecture for healthcare,

S. Mukherjee, P. Gamble, M. S. Ausin, N. Kant, K. Aggarwal, N. Manjunath, D. Datta, Z. Liu, J. Ding, S. Busacca, C. Bianco, S. Sharma, R. Lasko, M. V oisard, S. Harneja, D. Filippova, G. Meixiong, K. Cha, A. Youssefi, M. Buvanesh, H. Weingram, S. Bierman-Lytle, H. S. Mangat, K. Parikh, S. Godil, and A. Miller, “Polaris: A safety-focused llm constellation ...

work page
[37]

Available: https://arxiv.org/abs/2403.13313

[Online]. Available: https://arxiv.org/abs/2403.13313

work page arXiv
[38]

Medical hallucinations in foundation models and their impact on healthcare,

Y . Kim, H. Jeong, S. Chen, S. S. Li, M. Lu, K. Alhamoud, J. Mun, C. Grau, M. Jung, R. Gameiro, L. Fan, E. Park, T. Lin, J. Yoon, W. Yoon, M. Sap, Y . Tsvetkov, P. Liang, X. Xu, X. Liu, D. McDuff, H. Lee, H. W. Park, S. Tulebaev, and C. Breazeal, “Medical hallucinations in foundation models and their impact on healthcare,”

work page
[39]

Available: https://arxiv.org/abs/2503.05777

[Online]. Available: https://arxiv.org/abs/2503.05777

work page arXiv
[40]

Faithfulness hallucination detection in healthcare AI,

P. R. Vishwanath, S. Tiwari, T. G. Naik, S. Gupta, D. N. Thai, W. Zhao, S. KWON, V . Ardulov, K. Tarabishy, A. McCallum, and W. Salloum, “Faithfulness hallucination detection in healthcare AI,” in Artificial Intelligence and Data Science for Healthcare: Bridging Data- Centric AI and People-Centric Healthcare, 2024. [Online]. Available: https://openreview....

work page 2024
[41]

Listening to patients: A framework of detecting and mitigating patient misreport for medical dialogue generation,

L. Qin, Y . Zhang, H. Liang, A. Jatowt, and Z. Yang, “Listening to patients: A framework of detecting and mitigating patient misreport for medical dialogue generation,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06094

work page arXiv 2024
[42]

Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models,

Z. Zhu, Y . Zhang, X. Zhuang, F. Zhang, Z. Wan, Y . Chen, Q. QingqingLong, Y . Zheng, and X. Wu, “Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: As...

work page 2025
[43]

Medhalu: Hallucinations in responses to healthcare queries by large language models,

V . Agarwal, Y . Jin, M. Chandra, M. D. Choudhury, S. Kumar, and N. Sastry, “Medhalu: Hallucinations in responses to healthcare queries by large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2409.19492

work page arXiv 2025
[44]

A first look at gpt apps: Landscape and vulnerability,

Z. Zhang, L. Zhang, X. Yuan, A. Zhang, M. Xu, and F. Qian, “A first look at gpt apps: Landscape and vulnerability,” 2024. [Online]. Available: https://arxiv.org/abs/2402.15105

work page arXiv 2024
[45]

Hallucinations: Clinical aspects and management

C. S, “Hallucinations: Clinical aspects and management.”Industrial psychiatry journal, vol. 19, no. 1, pp. 5–12, 2010. [Online]. Available: https://doi.org/10.4103/0972-6748.77625

work page doi:10.4103/0972-6748.77625 2010
[46]

Addressing cognitive bias in medical language models.arXiv preprint arXiv:2402.08113,

S. Schmidgall, C. Harris, I. Essien, D. Olshvang, T. Rahman, J. W. Kim, R. Ziaei, J. Eshraghian, P. Abadir, and R. Chellappa, “Addressing cognitive bias in medical language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.08113

work page arXiv 2024
[47]

Med-halt: Medical domain hallucination test for large language models,

A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Med-halt: Medical domain hallucination test for large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.15343

work page arXiv 2023
[48]

A survey of large language models in medicine: Progress, application, and challenge,

H. Zhou, F. Liu, B. Gu, X. Zou, J. Huang, J. Wu, Y . Li, S. S. Chen, P. Zhou, J. Liu, Y . Hua, C. Mao, C. You, X. Wu, Y . Zheng, L. Clifton, Z. Li, J. Luo, and D. A. Clifton, “A survey of large language models in medicine: Progress, application, and challenge,” 2024. [Online]. Available: https://arxiv.org/abs/2311.05112

work page arXiv 2024
[49]

A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics,

K. He, R. Mao, Q. Lin, Y . Ruan, X. Lan, M. Feng, and E. Cambria, “A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics,” Information Fusion, vol. 118, p. 102963, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253525000363

work page 2025
[50]

A survey on medical large language models: Technology, application, trustworthiness, and future directions,

L. Liu, X. Yang, J. Lei, X. Liu, Y . Shen, Z. Zhang, P. Wei, J. Gu, Z. Chu, Z. Qin, and K. Ren, “A survey on medical large language models: Technology, application, trustworthiness, and future directions,”ArXiv, vol. abs/2406.03712, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:270285974

work page arXiv 2024
[51]

Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge,

Y . Li, Z. Li, K. Zhang, R. Dan, S. Jiang, and Y . Zhang, “Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge,” 2023. [Online]. Available: https://arxiv.org/abs/2303.14070

work page arXiv 2023
[52]

Jmlr: Joint medical llm and retrieval training for enhancing reasoning and professional question answering capability,

J. Wang, Z. Yang, Z. Yao, and H. Yu, “Jmlr: Joint medical llm and retrieval training for enhancing reasoning and professional question answering capability,”arXiv preprint arXiv:2402.17887, 2024

work page arXiv 2024
[53]

BioMistral: A collection of open-source pretrained large language models for medical domains,

Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, and R. Dufour, “BioMistral: A collection of open-source pretrained large language models for medical domains,” 2024

work page 2024
[54]

Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people,

X. Wang, N. Chen, J. Chen, Y . Hu, Y . Wang, X. Wu, A. Gao, X. Wan, H. Li, and B. Wang, “Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people,” 2024

work page 2024
[55]

Aloe: A family of fine-tuned open healthcare llms,

A. K. Gururajan, E. Lopez-Cuena, J. Bayarri-Planas, A. Tormos, D. Hin- jos, P. Bernabeu-Perez, A. Arias-Duart, P. A. Martin-Torres, L. Urcelay- Ganzabal, M. Gonzalez-Mallo, S. Alvarez-Napagao, E. Ayguad ´e-Parra, and U. C. D. Garcia-Gasulla, “Aloe: A family of fine-tuned open healthcare llms,” 2024

work page 2024
[56]

Mental health therapy chatbot,

Tanusri, “Mental health therapy chatbot,” 2024. [Online]. Available: https://huggingface.co/tanusrich/Mental Health Chatbot

work page 2024
[57]

Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue,

S. Yang, H. Zhao, S. Zhu, G. Zhou, H. Xu, Y . Jia, and H. Zan, “Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue,” 2023. [Online]. Available: https://arxiv.org/abs/2308.03549

work page arXiv 2023
[58]

Selenium webdriver,

Selenium Project, “Selenium webdriver,” https://www.selenium.dev, 2024, accessed: 2025-04-15

work page 2024
[59]

Biobart: Pretraining and evaluation of a biomedical generative language model,

H. Yuan, Z. Yuan, R. Gan, J. Zhang, Y . Xie, and S. Yu, “Biobart: Pretraining and evaluation of a biomedical generative language model,” 2022

work page 2022
[60]

Farquhar, J

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, no. 8017, pp. 625–630, 2024. [Online]. Available: https: //doi.org/10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024
[61]

Privacy policy,

OpenAI, “Privacy policy,” June 27, 2025. [Online]. Available: https://openai.com/policies/row-privacy-policy/

work page 2025
[62]

A Survey on LLM-as-a-Judge

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y . Wang, W. Gao, L. Ni, and J. Guo, “A survey on llm-as-a-judge,” 2025. [Online]. Available: https://arxiv.org/abs/2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Gemini 3.1 Pro: A smarter model for your most complex tasks,

“Gemini 3.1 Pro: A smarter model for your most complex tasks,” Feb. 2026. [Online]. Available: https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/

work page 2026
[64]

Theoretical foundations and mitigation of hallucination in large language models,

E. Gumaan, “Theoretical foundations and mitigation of hallucination in large language models,” 2025. [Online]. Available: https://arxiv.org/ abs/2507.22915

work page arXiv 2025
[65]

A stitch in time saves nine: Detecting and mitigating hallu- cinations of llms by validating low-confidence generation

N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation,” 2023. [Online]. Available: https://arxiv.org/abs/2307.03987

work page arXiv 2023
[66]

RARR: Researching and revising what language models say, using language models,

L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty, Y . Fan, V . Zhao, N. Lao, H. Lee, D.-C. Juan, and K. Guu, “RARR: Researching and revising what language models say, using language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. ...

work page 2023
[67]

Creating trustworthy llms: Dealing with hallucinations in healthcare ai,

M. A. Ahmad, I. Yaramis, and T. D. Roy, “Creating trustworthy llms: Dealing with hallucinations in healthcare ai,” 2023. [Online]. Available: https://arxiv.org/abs/2311.01463 TABLE IX: Operationalized policies and context for forbidden scenarios. Proscribed case Context Policy description Health consultation Using these models to advise someone on possibl...

work page arXiv 2023

[1] [1]

Readability of custom chatbot vs. gpt-4 responses to otolaryngology-related patient questions,

Y . Alsabawi, P. R. Quesada, and D. T. Rouse, “Readability of custom chatbot vs. gpt-4 responses to otolaryngology-related patient questions,”American Journal of Otolaryngology, vol. 46, no. 5, p. 104717, 2025. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0196070925001206

work page 2025

[2] [2]

OpenAI GPT store,

OpenAI, “OpenAI GPT store,” 2025. [Online]. Available: https: //openai.com/index/introducing-the-gpt-store/

work page 2025

[3] [3]

Gptracker: A large-scale measurement of misused gpts,

X. Shen, Y . Shen, M. Backes, and Y . Zhang, “Gptracker: A large-scale measurement of misused gpts,” in2025 IEEE Symposium on Security and Privacy (SP), 2025, pp. 336–354

work page 2025

[4] [4]

Unsafe by design? a first look at security and privacy risks in openai’s custom gpt ecosystem,

S. O. Ogundoyin, M. Ikram, H. J. Asghar, B. Z. H. Zhao, and D. Kaafar, “Unsafe by design? a first look at security and privacy risks in openai’s custom gpt ecosystem,” in2025 Workshop on Privacy in the Electronic Society (WPES ’25), October 13–17, 2025, Taipei, Taiwan. ACM, New York, NY, USA, 2025

work page 2025

[5] [5]

Galactica: A Large Language Model for Science

R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic, “Galactica: A large language model for science,”arXiv preprint arXiv:2211.09085, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Pmc-llama: Towards building open-source language models for medicine,

C. Wu, W. Lin, X. Zhang, Y . Zhang, Y . Wang, and W. Xie, “Pmc-llama: Towards building open-source language models for medicine,” 2023. [Online]. Available: https://arxiv.org/abs/2304.14454

work page arXiv 2023

[7] [7]

Medalpaca – an open-source collection of medical conversational ai models and training data,

T. Han, L. C. Adams, J.-M. Papaioannou, P. Grundmann, T. Oberhauser, A. Figueroa, A. L ¨oser, D. Truhn, and K. K. Bressem, “Medalpaca – an open-source collection of medical conversational ai models and training data,” 2025. [Online]. Available: https://arxiv.org/abs/2304.08247

work page arXiv 2025

[8] [9]

A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,

M. Ahmed, J. Lam, A. Chow, and C.-M. Chow, “A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,”CJC Open, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2589790X25001106

work page 2025

[9] [10]

Evaluating chatgpt, gemini and other large language models (llms) in orthopaedic diagnostics: A prospective clinical study,

S. Pagano, L. Strumolo, K. Michalk, J. Schiegl, L. C. Pulido, J. Reinhard, G. Maderbacher, T. Renkawitz, and M. Schuster, “Evaluating chatgpt, gemini and other large language models (llms) in orthopaedic diagnostics: A prospective clinical study,”Computational and Structural Biotechnology Journal, vol. 28, pp. 9–15, 2025. [Online]. Available: https://www....

work page 2025

[10] [11]

A Survey on Hallucination in Large Language Models

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,”ACM Transactions on Information Systems, vol. 43, no. 2, p. 1–55, Jan. 2025. [Online]. Available: http://dx.doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025

[11] [12]

Hallulens: Llm hallucination benchmark,

Y . Bang, Z. Ji, A. Schelten, A. Hartshorn, T. Fowler, C. Zhang, N. Cancedda, and P. Fung, “Hallulens: Llm hallucination benchmark,”

work page

[12] [13]

Available: https://arxiv.org/abs/2504.17550

[Online]. Available: https://arxiv.org/abs/2504.17550

work page arXiv

[13] [14]

Towards safer chatbots: A framework for policy compliance evaluation of custom GPTs,

D. Rodriguez, W. Seymour, J. M. D. Alamo, and J. Such, “Towards safer chatbots: A framework for policy compliance evaluation of custom GPTs,” 2025. [Online]. Available: https://arxiv.org/abs/2502.01436

work page arXiv 2025

[14] [15]

On the (In)Security of LLM App Stores,

X. Hou, Y . Zhao, and H. Wang, “On the (In)Security of LLM App Stores,” Jul. 2024, arXiv:2407.08422 [cs]. [Online]. Available: http://arxiv.org/abs/2407.08422

work page arXiv 2024

[15] [18]

Large language models encode clinical knowledge,

S. Karan, A. Shekoofeh, T. Tao, M. S. Sara, W. Jason, W. C. Hyung, S. Nathan, T. Ajay, C.-L. Heather, P. Stephen, P. Perry, S. Martin, G. Paul, K. Chris, B. Abubakr, S. Nathanael, C. Aakanksha, M. Philip, D.-F. Dina, A. y. A. Blaise, W. Dale, S. C. Greg, M. Yossi, C. Katherine, G. Juraj, T. Nenad, L. Yun, R. Alvin, B. Joelle, S. Christopher, K. Alan, and ...

work page doi:10.1038/s41586-023-06291-2 2023

[16] [19]

Gpt-4 technical report,

OpenAI, J. Achiam, S. Adler, and S. A. et al., “Gpt-4 technical report,”

work page

[17] [20]

GPT-4 Technical Report

[Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv

[18] [21]

A framework to assess clinical safety and hallucination rates of llms for medical text summarisation.npj Digital Medicine, 8:274, 2025

E. Asgari, N. Monta ˜na-Brown, M. Dubois, S. Khalil, J. Balloch, J. A. Yeung, and D. Pimenta, “A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,”npj Digital Medicine, vol. 8, no. 1, p. 274, May 2025. [Online]. Available: https://doi.org/10.1038/s41746-025-01670-7

work page doi:10.1038/s41746-025-01670-7 2025

[19] [22]

GPTs window shopping: An analysis of the landscape of custom ChatGPT models,

B. Z. H. Zhao, M. Ikram, and M. A. Kaafar, “GPTs window shopping: An analysis of the landscape of custom ChatGPT models,” 2024. [Online]. Available: https://arxiv.org/abs/2405.10547

work page arXiv 2024

[20] [23]

GPTApps.io, https://gptsapp.io/trending-gpts/top-1000-gpts-ranked

work page

[21] [24]

Gptstore.ai,

GPTStore.AI, “Gptstore.ai,” 2025. [Online]. Available: https://gptstore. ai/

work page 2025

[22] [25]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, 2021. [Online]. Available: https://www.mdpi.com/ 2076-3417/11/14/6421

work page 2021

[23] [26]

G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment,

L. Yang, I. Dan, X. Yichong, W. Shuohang, X. Ruochen, and Z. Chen- guang, “G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment,” in2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 2511–2522

work page 2023

[24] [27]

Bartscore: evaluating generated text as text generation,

W. Yuan, G. Neubig, and P. Liu, “Bartscore: evaluating generated text as text generation,” inProceedings of the 35th International Conference on Neural Information Processing Systems, ser. NIPS ’21. Red Hook, NY , USA: Curran Associates Inc., 2021

work page 2021

[25] [28]

Measuring semantic entropy,

I. D. Melamed, “Measuring semantic entropy,” inTagging Text with Lexical Semantics: Why, What, and How?, 1997. [Online]. Available: https://aclanthology.org/W97-0207/

work page 1997

[26] [29]

Chapter 4 - classification,

V . Kotu and B. Deshpande, “Chapter 4 - classification,” inData Science (Second Edition), second edition ed., V . Kotu and B. Deshpande, Eds. Morgan Kaufmann, 2019, pp. 65–163. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/B9780128147610000046

work page 2019

[27] [30]

A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,

M. Ahmed, J. Lam, A. Chow, and C.-M. Chow, “A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,”CJC Open, vol. 7, no. 5, pp. 660–666, Dec. 2025. [Online]. Available: https://doi.org/10.1016/j.cjco.2025.02.012

work page doi:10.1016/j.cjco.2025.02.012 2025

[28] [31]

Use of a large language model (llm) for ambulance dispatch and triage,

A. C. Shekhar, J. Kimbrell, A. Saharan, J. Stebel, E. Ashley, and E. E. Abbott, “Use of a large language model (llm) for ambulance dispatch and triage,”The American Journal of Emergency Medicine, vol. 89, pp. 27–29, 2025. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0735675724007150

work page 2025

[29] [32]

Assessment of large language models (llms) in decision-making support for gynecologic oncology,

K. E. Gumilar, B. R. Indraprasta, A. S. Faridzi, B. M. Wibowo, A. Herlambang, E. Rahestyningtyas, B. Irawan, Z. Tambunan, A. F. Bustomi, B. N. Brahmantara, Z.-Y . Yu, Y .-C. Hsu, H. Pramuditya, V . G. E. Putra, H. Nugroho, P. Mulawardhana, B. A. Tjokroprawiro, T. Hedianto, I. H. Ibrahim, J. Huang, D. Li, C.-H. Lu, J.-Y . Yang, L.-N. Liao, and M. Tan, “Ass...

work page 2024

[30] [34]

The application of large language models in medicine: A scoping review,

X. Meng, X. Yan, K. Zhang, D. Liu, X. Cui, Y . Yang, M. Zhang, C. Cao, J. Wang, X. Wang, J. Gao, Y .-G.-S. Wang, J. ming Ji, Z. Qiu, M. Li, C. Qian, T. Guo, S. Ma, Z. Wang, Z. Guo, Y . Lei, C. Shao, W. Wang, H. Fan, and Y .-D. Tang, “The application of large language models in medicine: A scoping review,” iScience, vol. 27, no. 5, p. 109713, 2024. [Online...

work page 2024

[31] [35]

Health-llm: Personalized retrieval- augmented disease prediction system,

Q. Yu, M. Jin, D. Shu, C. Zhang, L. Fan, W. Hua, S. Zhu, Y . Meng, Z. Wang, M. Du, and Y . Zhang, “Health-llm: Personalized retrieval- augmented disease prediction system,” 2025. [Online]. Available: https://arxiv.org/abs/2402.00746

work page arXiv 2025

[32] [36]

Polaris: A safety-focused llm constellation architecture for healthcare,

S. Mukherjee, P. Gamble, M. S. Ausin, N. Kant, K. Aggarwal, N. Manjunath, D. Datta, Z. Liu, J. Ding, S. Busacca, C. Bianco, S. Sharma, R. Lasko, M. V oisard, S. Harneja, D. Filippova, G. Meixiong, K. Cha, A. Youssefi, M. Buvanesh, H. Weingram, S. Bierman-Lytle, H. S. Mangat, K. Parikh, S. Godil, and A. Miller, “Polaris: A safety-focused llm constellation ...

work page

[33] [37]

Available: https://arxiv.org/abs/2403.13313

[Online]. Available: https://arxiv.org/abs/2403.13313

work page arXiv

[34] [38]

Medical hallucinations in foundation models and their impact on healthcare,

Y . Kim, H. Jeong, S. Chen, S. S. Li, M. Lu, K. Alhamoud, J. Mun, C. Grau, M. Jung, R. Gameiro, L. Fan, E. Park, T. Lin, J. Yoon, W. Yoon, M. Sap, Y . Tsvetkov, P. Liang, X. Xu, X. Liu, D. McDuff, H. Lee, H. W. Park, S. Tulebaev, and C. Breazeal, “Medical hallucinations in foundation models and their impact on healthcare,”

work page

[35] [39]

Available: https://arxiv.org/abs/2503.05777

[Online]. Available: https://arxiv.org/abs/2503.05777

work page arXiv

[36] [40]

Faithfulness hallucination detection in healthcare AI,

P. R. Vishwanath, S. Tiwari, T. G. Naik, S. Gupta, D. N. Thai, W. Zhao, S. KWON, V . Ardulov, K. Tarabishy, A. McCallum, and W. Salloum, “Faithfulness hallucination detection in healthcare AI,” in Artificial Intelligence and Data Science for Healthcare: Bridging Data- Centric AI and People-Centric Healthcare, 2024. [Online]. Available: https://openreview....

work page 2024

[37] [41]

Listening to patients: A framework of detecting and mitigating patient misreport for medical dialogue generation,

L. Qin, Y . Zhang, H. Liang, A. Jatowt, and Z. Yang, “Listening to patients: A framework of detecting and mitigating patient misreport for medical dialogue generation,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06094

work page arXiv 2024

[38] [42]

Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models,

Z. Zhu, Y . Zhang, X. Zhuang, F. Zhang, Z. Wan, Y . Chen, Q. QingqingLong, Y . Zheng, and X. Wu, “Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: As...

work page 2025

[39] [43]

Medhalu: Hallucinations in responses to healthcare queries by large language models,

V . Agarwal, Y . Jin, M. Chandra, M. D. Choudhury, S. Kumar, and N. Sastry, “Medhalu: Hallucinations in responses to healthcare queries by large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2409.19492

work page arXiv 2025

[40] [44]

A first look at gpt apps: Landscape and vulnerability,

Z. Zhang, L. Zhang, X. Yuan, A. Zhang, M. Xu, and F. Qian, “A first look at gpt apps: Landscape and vulnerability,” 2024. [Online]. Available: https://arxiv.org/abs/2402.15105

work page arXiv 2024

[41] [45]

Hallucinations: Clinical aspects and management

C. S, “Hallucinations: Clinical aspects and management.”Industrial psychiatry journal, vol. 19, no. 1, pp. 5–12, 2010. [Online]. Available: https://doi.org/10.4103/0972-6748.77625

work page doi:10.4103/0972-6748.77625 2010

[42] [46]

Addressing cognitive bias in medical language models.arXiv preprint arXiv:2402.08113,

S. Schmidgall, C. Harris, I. Essien, D. Olshvang, T. Rahman, J. W. Kim, R. Ziaei, J. Eshraghian, P. Abadir, and R. Chellappa, “Addressing cognitive bias in medical language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.08113

work page arXiv 2024

[43] [47]

Med-halt: Medical domain hallucination test for large language models,

A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Med-halt: Medical domain hallucination test for large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.15343

work page arXiv 2023

[44] [48]

A survey of large language models in medicine: Progress, application, and challenge,

H. Zhou, F. Liu, B. Gu, X. Zou, J. Huang, J. Wu, Y . Li, S. S. Chen, P. Zhou, J. Liu, Y . Hua, C. Mao, C. You, X. Wu, Y . Zheng, L. Clifton, Z. Li, J. Luo, and D. A. Clifton, “A survey of large language models in medicine: Progress, application, and challenge,” 2024. [Online]. Available: https://arxiv.org/abs/2311.05112

work page arXiv 2024

[45] [49]

A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics,

K. He, R. Mao, Q. Lin, Y . Ruan, X. Lan, M. Feng, and E. Cambria, “A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics,” Information Fusion, vol. 118, p. 102963, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253525000363

work page 2025

[46] [50]

A survey on medical large language models: Technology, application, trustworthiness, and future directions,

L. Liu, X. Yang, J. Lei, X. Liu, Y . Shen, Z. Zhang, P. Wei, J. Gu, Z. Chu, Z. Qin, and K. Ren, “A survey on medical large language models: Technology, application, trustworthiness, and future directions,”ArXiv, vol. abs/2406.03712, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:270285974

work page arXiv 2024

[47] [51]

Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge,

Y . Li, Z. Li, K. Zhang, R. Dan, S. Jiang, and Y . Zhang, “Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge,” 2023. [Online]. Available: https://arxiv.org/abs/2303.14070

work page arXiv 2023

[48] [52]

Jmlr: Joint medical llm and retrieval training for enhancing reasoning and professional question answering capability,

J. Wang, Z. Yang, Z. Yao, and H. Yu, “Jmlr: Joint medical llm and retrieval training for enhancing reasoning and professional question answering capability,”arXiv preprint arXiv:2402.17887, 2024

work page arXiv 2024

[49] [53]

BioMistral: A collection of open-source pretrained large language models for medical domains,

Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, and R. Dufour, “BioMistral: A collection of open-source pretrained large language models for medical domains,” 2024

work page 2024

[50] [54]

Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people,

X. Wang, N. Chen, J. Chen, Y . Hu, Y . Wang, X. Wu, A. Gao, X. Wan, H. Li, and B. Wang, “Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people,” 2024

work page 2024

[51] [55]

Aloe: A family of fine-tuned open healthcare llms,

A. K. Gururajan, E. Lopez-Cuena, J. Bayarri-Planas, A. Tormos, D. Hin- jos, P. Bernabeu-Perez, A. Arias-Duart, P. A. Martin-Torres, L. Urcelay- Ganzabal, M. Gonzalez-Mallo, S. Alvarez-Napagao, E. Ayguad ´e-Parra, and U. C. D. Garcia-Gasulla, “Aloe: A family of fine-tuned open healthcare llms,” 2024

work page 2024

[52] [56]

Mental health therapy chatbot,

Tanusri, “Mental health therapy chatbot,” 2024. [Online]. Available: https://huggingface.co/tanusrich/Mental Health Chatbot

work page 2024

[53] [57]

Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue,

S. Yang, H. Zhao, S. Zhu, G. Zhou, H. Xu, Y . Jia, and H. Zan, “Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue,” 2023. [Online]. Available: https://arxiv.org/abs/2308.03549

work page arXiv 2023

[54] [58]

Selenium webdriver,

Selenium Project, “Selenium webdriver,” https://www.selenium.dev, 2024, accessed: 2025-04-15

work page 2024

[55] [59]

Biobart: Pretraining and evaluation of a biomedical generative language model,

H. Yuan, Z. Yuan, R. Gan, J. Zhang, Y . Xie, and S. Yu, “Biobart: Pretraining and evaluation of a biomedical generative language model,” 2022

work page 2022

[56] [60]

Farquhar, J

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, no. 8017, pp. 625–630, 2024. [Online]. Available: https: //doi.org/10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024

[57] [61]

Privacy policy,

OpenAI, “Privacy policy,” June 27, 2025. [Online]. Available: https://openai.com/policies/row-privacy-policy/

work page 2025

[58] [62]

A Survey on LLM-as-a-Judge

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y . Wang, W. Gao, L. Ni, and J. Guo, “A survey on llm-as-a-judge,” 2025. [Online]. Available: https://arxiv.org/abs/2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [63]

Gemini 3.1 Pro: A smarter model for your most complex tasks,

“Gemini 3.1 Pro: A smarter model for your most complex tasks,” Feb. 2026. [Online]. Available: https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/

work page 2026

[60] [64]

Theoretical foundations and mitigation of hallucination in large language models,

E. Gumaan, “Theoretical foundations and mitigation of hallucination in large language models,” 2025. [Online]. Available: https://arxiv.org/ abs/2507.22915

work page arXiv 2025

[61] [65]

A stitch in time saves nine: Detecting and mitigating hallu- cinations of llms by validating low-confidence generation

N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation,” 2023. [Online]. Available: https://arxiv.org/abs/2307.03987

work page arXiv 2023

[62] [66]

RARR: Researching and revising what language models say, using language models,

L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty, Y . Fan, V . Zhao, N. Lao, H. Lee, D.-C. Juan, and K. Guu, “RARR: Researching and revising what language models say, using language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. ...

work page 2023

[63] [67]

Creating trustworthy llms: Dealing with hallucinations in healthcare ai,

M. A. Ahmad, I. Yaramis, and T. D. Roy, “Creating trustworthy llms: Dealing with hallucinations in healthcare ai,” 2023. [Online]. Available: https://arxiv.org/abs/2311.01463 TABLE IX: Operationalized policies and context for forbidden scenarios. Proscribed case Context Policy description Health consultation Using these models to advise someone on possibl...

work page arXiv 2023