A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

Boyang Gu; Chengfeng Mao; Chenyu You; David A. Clifton; Fenglin Liu; Hongjian Zhou; Jiebo Luo; Jinfa Huang; Jinge Wu; Junling Liu

arxiv: 2311.05112 · v7 · pith:F3N5ZPXXnew · submitted 2023-11-09 · 💻 cs.CL · cs.AI

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

Hongjian Zhou , Fenglin Liu , Boyang Gu , Xinyu Zou , Jinfa Huang , Jinge Wu , Yiru Li , Sam S. Chen

show 11 more authors

Peilin Zhou Junling Liu Yining Hua Chengfeng Mao Chenyu You Xian Wu Yefeng Zheng Lei Clifton Zheng Li Jiebo Luo David A. Clifton

This is my paper

classification 💻 cs.CL cs.AI

keywords llmsmedicalmedicinedevelopmentreviewlanguagemodelspractical

0 comments

read the original abstract

Large language models (LLMs), such as ChatGPT, have received substantial attention due to their capabilities for understanding and generating human language. While there has been a burgeoning trend in research focusing on the employment of LLMs in supporting different medical tasks (e.g., enhancing clinical diagnostics and providing medical education), a review of these efforts, particularly their development, practical applications, and outcomes in medicine, remains scarce. Therefore, this review aims to provide a detailed overview of the development and deployment of LLMs in medicine, including the challenges and opportunities they face. In terms of development, we provide a detailed introduction to the principles of existing medical LLMs, including their basic model structures, number of parameters, and sources and scales of data used for model development. It serves as a guide for practitioners in developing medical LLMs tailored to their specific needs. In terms of deployment, we offer a comparison of the performance of different LLMs across various medical tasks, and further compare them with state-of-the-art lightweight models, aiming to provide an understanding of the advantages and limitations of LLMs in medicine. Overall, in this review, we address the following questions: 1) What are the practices for developing medical LLMs 2) How to measure the medical task performance of LLMs in a medical setting? 3) How have medical LLMs been employed in real-world practice? 4) What challenges arise from the use of medical LLMs? and 5) How to more effectively develop and deploy medical LLMs? By answering these questions, this review aims to provide insights into the opportunities for LLMs in medicine and serve as a practical resource. We also maintain a regularly updated list of practical guides on medical LLMs at https://github.com/AI-in-Health/MedLLMsPracticalGuide

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Evaluation of 6233 MedGPTs finds 25-30% with low factual accuracy, 33.6-54.3% violating operational thresholds, and 57% of action-enabled models lacking privacy disclosures.
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
cs.CL 2026-05 unverdicted novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
Whole-body CT attenuation and volume charts from routine clinical scans via evidence-grounded LLM report filtering
cs.CV 2026-05 unverdicted novelty 6.0

An evidence-grounded LLM ensemble filters pathology from clinical radiology reports to create large healthy cohorts and build age-, sex-, and contrast-adjusted reference charts for 106 CT structures using generalized ...
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
cs.CR 2026-04 unverdicted novelty 6.0

CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews
cs.CL 2026-03 unverdicted novelty 6.0

EviSearch automates ontology-aligned clinical evidence table creation from native PDFs with comprehensive provenance logging for auditability and iterative improvement.
Safe Autoregressive Image Generation with Iterative Self-Improving Codebooks
cs.CV 2026-06 unverdicted novelty 5.0

Iterative self-improving codebooks enhance safety in autoregressive multimodal models by self-identifying unsafe generations and updating the codebook to eliminate harmful visual token mappings without external feedback.
Systematic Evaluation of Large Language Models for Post-Discharge Clinical Action Extraction
cs.AI 2026-05 unverdicted novelty 5.0

LLMs match or beat supervised BERT models on detecting whether a discharge note contains an actionable clinical task but trail on classifying the exact type of action, pointing to the need for datasets that explain wh...
UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA
cs.CV 2026-06 unverdicted novelty 4.0

UniReason-Med introduces a unified framework for 2D and 3D medical VQA with shared grounded reasoning, trained on a 220K dataset, claiming that joint 2D+3D supervision improves 3D performance over 3D-only training.
QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model
cs.CL 2025-04 unverdicted novelty 4.0

QM-ToT applies Tree of Thoughts decomposition and evaluator layers to quantized LLMs, reporting accuracy gains from 34% to 50% on MedQAUSMLE for LLaMA2-70b and from 58.77% to 69.49% for LLaMA-3.1-8b, plus an 86.27% im...
Development and Preliminary Evaluation of a Domain-Specific Large Language Model for Tuberculosis Care in South Africa
cs.CL 2026-03 unverdicted novelty 3.0

A domain-specific LLM for TB care in South Africa, created by fine-tuning BioMistral-7B with QLoRA and GraphRAG on local guidelines, shows improved contextual alignment over the base model.
Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module
cs.CV 2025-03 unverdicted novelty 3.0

CA-TriNet combines co-attention transformers with a triple-LSTM module for medical report generation and reports outperforming prior models on three public datasets.
A Survey on the Memory Mechanism of Large Language Model based Agents
cs.AI 2024-04 accept novelty 3.0

A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.