Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models
Pith reviewed 2026-05-21 05:46 UTC · model grok-4.3
The pith
A large-scale review finds 25-30 percent of web medical language models produce inaccurate clinical advice.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using the MedGPT-HEval framework to score factual accuracy and an LLM pipeline to flag policy violations, the study finds that 25-30 percent of the sampled MedGPTs display low factual accuracy, with bottom- and middle-tier models at highest risk, while 33.6-54.3 percent breach operational thresholds and 57.06 percent of action-enabled models lack adequate privacy disclosures. MedGPTs score higher on factual accuracy and semantic alignment than the open-source models, yet the open-source models prove more stable overall.
What carries the argument
MedGPT-HEval framework for hallucination detection together with an LLM-based pipeline that scores policy violations and developer intent.
If this is right
- Platforms hosting medical models would need routine multi-metric checks before allowing public access.
- Developers of lower-tier medical GPTs would face pressure to improve accuracy or add clear disclaimers.
- Action-enabled models would require mandatory privacy disclosures to meet basic safety standards.
- Open-source medical models would need separate stability testing even when their accuracy trails custom versions.
Where Pith is reading between the lines
- Clinicians using these tools for quick lookups might still need to cross-check outputs against established medical references.
- Public web platforms could introduce warning labels or usage limits for models that fail the accuracy thresholds.
- Extending the same checks to live patient dialogues rather than static prompts could expose additional real-world failure modes.
Load-bearing premise
The two evaluation frameworks correctly flag hallucinations and policy breaches in the sampled models without large numbers of false positives or selection bias.
What would settle it
An independent team manually reviewing the same 1,500 models and reporting substantially lower rates of factual errors or policy violations.
Figures
read the original abstract
Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates risks in 6,233 web-deployed medical GPTs (MedGPTs) plus 10 open-source LLMs by analyzing a stratified sample of 1,500 MedGPTs. It introduces the MedGPT-HEval framework for hallucination detection and an LLM-based pipeline for policy violations and developer intent. Reported results include 25-30% of MedGPTs with low factual accuracy (highest in bottom- and middle-tier models), 33.6-54.3% violating operational thresholds, and 57.06% of Action-enabled models lacking adequate privacy disclosures. MedGPTs show higher factual accuracy and semantic alignment than open-source models but lower stability. The authors release the HAA-MedGPT dataset.
Significance. If the evaluation frameworks prove reliable, the work supplies large-scale empirical data on hallucination and compliance failures in real-world medical LLMs, underscoring the need for multi-metric safeguards. Strengths include the stratified sampling across tiers, release of a structured dataset for future research, and direct comparison to open-source baselines.
major comments (2)
- [Methods (MedGPT-HEval framework)] Methods section describing MedGPT-HEval: the framework for hallucination detection is presented without reported human validation, inter-rater agreement metrics, expert-annotated gold standard, or ablation on prompt sensitivity. Because the headline 25-30% low factual accuracy claim and the tier-based risk comparisons rest entirely on the correctness of these automated labels, the absence of validation data is load-bearing for the central empirical results.
- [Methods and Results (policy violation pipeline)] Methods and Results sections on the LLM-based policy pipeline: no external benchmarks, human review, or false-positive analysis are provided for the operational thresholds or privacy-disclosure judgments. This directly affects the reliability of the 33.6-54.3% violation rates and the 57.06% privacy-gap statistic, especially given the possibility that the judge LLM may over-flag hedging language or miss subtle breaches.
minor comments (2)
- [Abstract] The abstract states both '25-30%' and specific ranges such as '33.6-54.3%' without clarifying whether the broader interval reflects different tiers or post-hoc adjustments; a single consistent reporting format would improve clarity.
- [Figures and Tables] Figure captions and table legends could more explicitly define the stratification criteria (e.g., what constitutes 'bottom-tier' vs. 'middle-tier' MedGPTs) to allow readers to assess selection bias.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate additional validation for the automated frameworks, thereby strengthening the reliability of the reported empirical results.
read point-by-point responses
-
Referee: [Methods (MedGPT-HEval framework)] Methods section describing MedGPT-HEval: the framework for hallucination detection is presented without reported human validation, inter-rater agreement metrics, expert-annotated gold standard, or ablation on prompt sensitivity. Because the headline 25-30% low factual accuracy claim and the tier-based risk comparisons rest entirely on the correctness of these automated labels, the absence of validation data is load-bearing for the central empirical results.
Authors: We agree that explicit validation of MedGPT-HEval is essential given its role in the factual accuracy and tier-comparison results. The current manuscript describes the automated pipeline and its application to the stratified sample but does not report human validation, inter-rater metrics, or prompt ablations. In the revised manuscript we will add a new Methods subsection presenting human validation on a stratified subset of 300 responses, including inter-rater agreement from two domain experts, a released expert-annotated gold-standard sample, and prompt-sensitivity results. These additions directly address the load-bearing concern for the 25-30% low-accuracy finding. revision: yes
-
Referee: [Methods and Results (policy violation pipeline)] Methods and Results sections on the LLM-based policy pipeline: no external benchmarks, human review, or false-positive analysis are provided for the operational thresholds or privacy-disclosure judgments. This directly affects the reliability of the 33.6-54.3% violation rates and the 57.06% privacy-gap statistic, especially given the possibility that the judge LLM may over-flag hedging language or miss subtle breaches.
Authors: We concur that the absence of human review and false-positive analysis limits confidence in the policy-violation and privacy statistics. The manuscript currently presents the LLM judge pipeline and aggregate rates without these checks. We will revise both the Methods and Results sections to include a human audit of 200 randomly sampled judgments, reporting precision, false-positive rates (with explicit discussion of hedging language), and any missed subtle breaches. This will support the reported 33.6-54.3% violation and 57.06% privacy-gap figures. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's central results consist of empirical measurements obtained by applying the newly introduced MedGPT-HEval framework and LLM-based policy pipeline to a stratified sample of 1,500 MedGPTs plus 10 open-source models. No equations, fitted parameters, or self-citations are present that reduce the reported percentages (25-30% low factual accuracy, 33.6-54.3% operational violations, 57.06% privacy gaps) to self-defined quantities or inputs by construction. The evaluation chain relies on direct comparison of model outputs against external factual and policy criteria rather than any loop that re-derives the same quantities from the measurement process itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The stratified sample of 1500 models is representative of the full population of 6233 MedGPTs for hallucination and compliance properties.
Reference graph
Works this paper leans on
-
[1]
Readability of custom chatbot vs. gpt-4 responses to otolaryngology-related patient questions,
Y . Alsabawi, P. R. Quesada, and D. T. Rouse, “Readability of custom chatbot vs. gpt-4 responses to otolaryngology-related patient questions,”American Journal of Otolaryngology, vol. 46, no. 5, p. 104717, 2025. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0196070925001206
work page 2025
-
[2]
OpenAI, “OpenAI GPT store,” 2025. [Online]. Available: https: //openai.com/index/introducing-the-gpt-store/
work page 2025
-
[3]
Gptracker: A large-scale measurement of misused gpts,
X. Shen, Y . Shen, M. Backes, and Y . Zhang, “Gptracker: A large-scale measurement of misused gpts,” in2025 IEEE Symposium on Security and Privacy (SP), 2025, pp. 336–354
work page 2025
-
[4]
Unsafe by design? a first look at security and privacy risks in openai’s custom gpt ecosystem,
S. O. Ogundoyin, M. Ikram, H. J. Asghar, B. Z. H. Zhao, and D. Kaafar, “Unsafe by design? a first look at security and privacy risks in openai’s custom gpt ecosystem,” in2025 Workshop on Privacy in the Electronic Society (WPES ’25), October 13–17, 2025, Taipei, Taiwan. ACM, New York, NY, USA, 2025
work page 2025
-
[5]
Galactica: A Large Language Model for Science
R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic, “Galactica: A large language model for science,”arXiv preprint arXiv:2211.09085, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Pmc-llama: Towards building open-source language models for medicine,
C. Wu, W. Lin, X. Zhang, Y . Zhang, Y . Wang, and W. Xie, “Pmc-llama: Towards building open-source language models for medicine,” 2023. [Online]. Available: https://arxiv.org/abs/2304.14454
-
[7]
Medalpaca – an open-source collection of medical conversational ai models and training data,
T. Han, L. C. Adams, J.-M. Papaioannou, P. Grundmann, T. Oberhauser, A. Figueroa, A. L ¨oser, D. Truhn, and K. K. Bressem, “Medalpaca – an open-source collection of medical conversational ai models and training data,” 2025. [Online]. Available: https://arxiv.org/abs/2304.08247
-
[9]
A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,
M. Ahmed, J. Lam, A. Chow, and C.-M. Chow, “A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,”CJC Open, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2589790X25001106
work page 2025
-
[10]
S. Pagano, L. Strumolo, K. Michalk, J. Schiegl, L. C. Pulido, J. Reinhard, G. Maderbacher, T. Renkawitz, and M. Schuster, “Evaluating chatgpt, gemini and other large language models (llms) in orthopaedic diagnostics: A prospective clinical study,”Computational and Structural Biotechnology Journal, vol. 28, pp. 9–15, 2025. [Online]. Available: https://www....
work page 2025
-
[11]
A Survey on Hallucination in Large Language Models
L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,”ACM Transactions on Information Systems, vol. 43, no. 2, p. 1–55, Jan. 2025. [Online]. Available: http://dx.doi.org/10.1145/3703155
-
[12]
Hallulens: Llm hallucination benchmark,
Y . Bang, Z. Ji, A. Schelten, A. Hartshorn, T. Fowler, C. Zhang, N. Cancedda, and P. Fung, “Hallulens: Llm hallucination benchmark,”
-
[13]
Available: https://arxiv.org/abs/2504.17550
[Online]. Available: https://arxiv.org/abs/2504.17550
-
[14]
Towards safer chatbots: A framework for policy compliance evaluation of custom GPTs,
D. Rodriguez, W. Seymour, J. M. D. Alamo, and J. Such, “Towards safer chatbots: A framework for policy compliance evaluation of custom GPTs,” 2025. [Online]. Available: https://arxiv.org/abs/2502.01436
-
[15]
On the (In)Security of LLM App Stores,
X. Hou, Y . Zhao, and H. Wang, “On the (In)Security of LLM App Stores,” Jul. 2024, arXiv:2407.08422 [cs]. [Online]. Available: http://arxiv.org/abs/2407.08422
-
[18]
Large language models encode clinical knowledge,
S. Karan, A. Shekoofeh, T. Tao, M. S. Sara, W. Jason, W. C. Hyung, S. Nathan, T. Ajay, C.-L. Heather, P. Stephen, P. Perry, S. Martin, G. Paul, K. Chris, B. Abubakr, S. Nathanael, C. Aakanksha, M. Philip, D.-F. Dina, A. y. A. Blaise, W. Dale, S. C. Greg, M. Yossi, C. Katherine, G. Juraj, T. Nenad, L. Yun, R. Alvin, B. Joelle, S. Christopher, K. Alan, and ...
-
[19]
OpenAI, J. Achiam, S. Adler, and S. A. et al., “Gpt-4 technical report,”
-
[20]
[Online]. Available: https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
E. Asgari, N. Monta ˜na-Brown, M. Dubois, S. Khalil, J. Balloch, J. A. Yeung, and D. Pimenta, “A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,”npj Digital Medicine, vol. 8, no. 1, p. 274, May 2025. [Online]. Available: https://doi.org/10.1038/s41746-025-01670-7
-
[22]
GPTs window shopping: An analysis of the landscape of custom ChatGPT models,
B. Z. H. Zhao, M. Ikram, and M. A. Kaafar, “GPTs window shopping: An analysis of the landscape of custom ChatGPT models,” 2024. [Online]. Available: https://arxiv.org/abs/2405.10547
-
[23]
GPTApps.io, https://gptsapp.io/trending-gpts/top-1000-gpts-ranked
-
[24]
GPTStore.AI, “Gptstore.ai,” 2025. [Online]. Available: https://gptstore. ai/
work page 2025
-
[25]
D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, 2021. [Online]. Available: https://www.mdpi.com/ 2076-3417/11/14/6421
work page 2021
-
[26]
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment,
L. Yang, I. Dan, X. Yichong, W. Shuohang, X. Ruochen, and Z. Chen- guang, “G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment,” in2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 2511–2522
work page 2023
-
[27]
Bartscore: evaluating generated text as text generation,
W. Yuan, G. Neubig, and P. Liu, “Bartscore: evaluating generated text as text generation,” inProceedings of the 35th International Conference on Neural Information Processing Systems, ser. NIPS ’21. Red Hook, NY , USA: Curran Associates Inc., 2021
work page 2021
-
[28]
I. D. Melamed, “Measuring semantic entropy,” inTagging Text with Lexical Semantics: Why, What, and How?, 1997. [Online]. Available: https://aclanthology.org/W97-0207/
work page 1997
-
[29]
V . Kotu and B. Deshpande, “Chapter 4 - classification,” inData Science (Second Edition), second edition ed., V . Kotu and B. Deshpande, Eds. Morgan Kaufmann, 2019, pp. 65–163. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/B9780128147610000046
work page 2019
-
[30]
A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,
M. Ahmed, J. Lam, A. Chow, and C.-M. Chow, “A primer on large language models (llms) and chatgpt for cardiovascular healthcare professionals,”CJC Open, vol. 7, no. 5, pp. 660–666, Dec. 2025. [Online]. Available: https://doi.org/10.1016/j.cjco.2025.02.012
-
[31]
Use of a large language model (llm) for ambulance dispatch and triage,
A. C. Shekhar, J. Kimbrell, A. Saharan, J. Stebel, E. Ashley, and E. E. Abbott, “Use of a large language model (llm) for ambulance dispatch and triage,”The American Journal of Emergency Medicine, vol. 89, pp. 27–29, 2025. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0735675724007150
work page 2025
-
[32]
Assessment of large language models (llms) in decision-making support for gynecologic oncology,
K. E. Gumilar, B. R. Indraprasta, A. S. Faridzi, B. M. Wibowo, A. Herlambang, E. Rahestyningtyas, B. Irawan, Z. Tambunan, A. F. Bustomi, B. N. Brahmantara, Z.-Y . Yu, Y .-C. Hsu, H. Pramuditya, V . G. E. Putra, H. Nugroho, P. Mulawardhana, B. A. Tjokroprawiro, T. Hedianto, I. H. Ibrahim, J. Huang, D. Li, C.-H. Lu, J.-Y . Yang, L.-N. Liao, and M. Tan, “Ass...
work page 2024
-
[34]
The application of large language models in medicine: A scoping review,
X. Meng, X. Yan, K. Zhang, D. Liu, X. Cui, Y . Yang, M. Zhang, C. Cao, J. Wang, X. Wang, J. Gao, Y .-G.-S. Wang, J. ming Ji, Z. Qiu, M. Li, C. Qian, T. Guo, S. Ma, Z. Wang, Z. Guo, Y . Lei, C. Shao, W. Wang, H. Fan, and Y .-D. Tang, “The application of large language models in medicine: A scoping review,” iScience, vol. 27, no. 5, p. 109713, 2024. [Online...
work page 2024
-
[35]
Health-llm: Personalized retrieval- augmented disease prediction system,
Q. Yu, M. Jin, D. Shu, C. Zhang, L. Fan, W. Hua, S. Zhu, Y . Meng, Z. Wang, M. Du, and Y . Zhang, “Health-llm: Personalized retrieval- augmented disease prediction system,” 2025. [Online]. Available: https://arxiv.org/abs/2402.00746
-
[36]
Polaris: A safety-focused llm constellation architecture for healthcare,
S. Mukherjee, P. Gamble, M. S. Ausin, N. Kant, K. Aggarwal, N. Manjunath, D. Datta, Z. Liu, J. Ding, S. Busacca, C. Bianco, S. Sharma, R. Lasko, M. V oisard, S. Harneja, D. Filippova, G. Meixiong, K. Cha, A. Youssefi, M. Buvanesh, H. Weingram, S. Bierman-Lytle, H. S. Mangat, K. Parikh, S. Godil, and A. Miller, “Polaris: A safety-focused llm constellation ...
-
[37]
Available: https://arxiv.org/abs/2403.13313
[Online]. Available: https://arxiv.org/abs/2403.13313
-
[38]
Medical hallucinations in foundation models and their impact on healthcare,
Y . Kim, H. Jeong, S. Chen, S. S. Li, M. Lu, K. Alhamoud, J. Mun, C. Grau, M. Jung, R. Gameiro, L. Fan, E. Park, T. Lin, J. Yoon, W. Yoon, M. Sap, Y . Tsvetkov, P. Liang, X. Xu, X. Liu, D. McDuff, H. Lee, H. W. Park, S. Tulebaev, and C. Breazeal, “Medical hallucinations in foundation models and their impact on healthcare,”
-
[39]
Available: https://arxiv.org/abs/2503.05777
[Online]. Available: https://arxiv.org/abs/2503.05777
-
[40]
Faithfulness hallucination detection in healthcare AI,
P. R. Vishwanath, S. Tiwari, T. G. Naik, S. Gupta, D. N. Thai, W. Zhao, S. KWON, V . Ardulov, K. Tarabishy, A. McCallum, and W. Salloum, “Faithfulness hallucination detection in healthcare AI,” in Artificial Intelligence and Data Science for Healthcare: Bridging Data- Centric AI and People-Centric Healthcare, 2024. [Online]. Available: https://openreview....
work page 2024
-
[41]
L. Qin, Y . Zhang, H. Liang, A. Jatowt, and Z. Yang, “Listening to patients: A framework of detecting and mitigating patient misreport for medical dialogue generation,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06094
-
[42]
Z. Zhu, Y . Zhang, X. Zhuang, F. Zhang, Z. Wan, Y . Chen, Q. QingqingLong, Y . Zheng, and X. Wu, “Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: As...
work page 2025
-
[43]
Medhalu: Hallucinations in responses to healthcare queries by large language models,
V . Agarwal, Y . Jin, M. Chandra, M. D. Choudhury, S. Kumar, and N. Sastry, “Medhalu: Hallucinations in responses to healthcare queries by large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2409.19492
-
[44]
A first look at gpt apps: Landscape and vulnerability,
Z. Zhang, L. Zhang, X. Yuan, A. Zhang, M. Xu, and F. Qian, “A first look at gpt apps: Landscape and vulnerability,” 2024. [Online]. Available: https://arxiv.org/abs/2402.15105
-
[45]
Hallucinations: Clinical aspects and management
C. S, “Hallucinations: Clinical aspects and management.”Industrial psychiatry journal, vol. 19, no. 1, pp. 5–12, 2010. [Online]. Available: https://doi.org/10.4103/0972-6748.77625
-
[46]
Addressing cognitive bias in medical language models.arXiv preprint arXiv:2402.08113,
S. Schmidgall, C. Harris, I. Essien, D. Olshvang, T. Rahman, J. W. Kim, R. Ziaei, J. Eshraghian, P. Abadir, and R. Chellappa, “Addressing cognitive bias in medical language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.08113
-
[47]
Med-halt: Medical domain hallucination test for large language models,
A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Med-halt: Medical domain hallucination test for large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.15343
-
[48]
A survey of large language models in medicine: Progress, application, and challenge,
H. Zhou, F. Liu, B. Gu, X. Zou, J. Huang, J. Wu, Y . Li, S. S. Chen, P. Zhou, J. Liu, Y . Hua, C. Mao, C. You, X. Wu, Y . Zheng, L. Clifton, Z. Li, J. Luo, and D. A. Clifton, “A survey of large language models in medicine: Progress, application, and challenge,” 2024. [Online]. Available: https://arxiv.org/abs/2311.05112
-
[49]
K. He, R. Mao, Q. Lin, Y . Ruan, X. Lan, M. Feng, and E. Cambria, “A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics,” Information Fusion, vol. 118, p. 102963, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253525000363
work page 2025
-
[50]
L. Liu, X. Yang, J. Lei, X. Liu, Y . Shen, Z. Zhang, P. Wei, J. Gu, Z. Chu, Z. Qin, and K. Ren, “A survey on medical large language models: Technology, application, trustworthiness, and future directions,”ArXiv, vol. abs/2406.03712, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:270285974
-
[51]
Y . Li, Z. Li, K. Zhang, R. Dan, S. Jiang, and Y . Zhang, “Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge,” 2023. [Online]. Available: https://arxiv.org/abs/2303.14070
-
[52]
J. Wang, Z. Yang, Z. Yao, and H. Yu, “Jmlr: Joint medical llm and retrieval training for enhancing reasoning and professional question answering capability,”arXiv preprint arXiv:2402.17887, 2024
-
[53]
BioMistral: A collection of open-source pretrained large language models for medical domains,
Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, and R. Dufour, “BioMistral: A collection of open-source pretrained large language models for medical domains,” 2024
work page 2024
-
[54]
Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people,
X. Wang, N. Chen, J. Chen, Y . Hu, Y . Wang, X. Wu, A. Gao, X. Wan, H. Li, and B. Wang, “Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people,” 2024
work page 2024
-
[55]
Aloe: A family of fine-tuned open healthcare llms,
A. K. Gururajan, E. Lopez-Cuena, J. Bayarri-Planas, A. Tormos, D. Hin- jos, P. Bernabeu-Perez, A. Arias-Duart, P. A. Martin-Torres, L. Urcelay- Ganzabal, M. Gonzalez-Mallo, S. Alvarez-Napagao, E. Ayguad ´e-Parra, and U. C. D. Garcia-Gasulla, “Aloe: A family of fine-tuned open healthcare llms,” 2024
work page 2024
-
[56]
Mental health therapy chatbot,
Tanusri, “Mental health therapy chatbot,” 2024. [Online]. Available: https://huggingface.co/tanusrich/Mental Health Chatbot
work page 2024
-
[57]
S. Yang, H. Zhao, S. Zhu, G. Zhou, H. Xu, Y . Jia, and H. Zan, “Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue,” 2023. [Online]. Available: https://arxiv.org/abs/2308.03549
-
[58]
Selenium Project, “Selenium webdriver,” https://www.selenium.dev, 2024, accessed: 2025-04-15
work page 2024
-
[59]
Biobart: Pretraining and evaluation of a biomedical generative language model,
H. Yuan, Z. Yuan, R. Gan, J. Zhang, Y . Xie, and S. Yu, “Biobart: Pretraining and evaluation of a biomedical generative language model,” 2022
work page 2022
-
[60]
S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, no. 8017, pp. 625–630, 2024. [Online]. Available: https: //doi.org/10.1038/s41586-024-07421-0
-
[61]
OpenAI, “Privacy policy,” June 27, 2025. [Online]. Available: https://openai.com/policies/row-privacy-policy/
work page 2025
-
[62]
J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y . Wang, W. Gao, L. Ni, and J. Guo, “A survey on llm-as-a-judge,” 2025. [Online]. Available: https://arxiv.org/abs/2411.15594
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Gemini 3.1 Pro: A smarter model for your most complex tasks,
“Gemini 3.1 Pro: A smarter model for your most complex tasks,” Feb. 2026. [Online]. Available: https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/
work page 2026
-
[64]
Theoretical foundations and mitigation of hallucination in large language models,
E. Gumaan, “Theoretical foundations and mitigation of hallucination in large language models,” 2025. [Online]. Available: https://arxiv.org/ abs/2507.22915
-
[65]
N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation,” 2023. [Online]. Available: https://arxiv.org/abs/2307.03987
-
[66]
RARR: Researching and revising what language models say, using language models,
L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty, Y . Fan, V . Zhao, N. Lao, H. Lee, D.-C. Juan, and K. Guu, “RARR: Researching and revising what language models say, using language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. ...
work page 2023
-
[67]
Creating trustworthy llms: Dealing with hallucinations in healthcare ai,
M. A. Ahmad, I. Yaramis, and T. D. Roy, “Creating trustworthy llms: Dealing with hallucinations in healthcare ai,” 2023. [Online]. Available: https://arxiv.org/abs/2311.01463 TABLE IX: Operationalized policies and context for forbidden scenarios. Proscribed case Context Policy description Health consultation Using these models to advise someone on possibl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.