CLIN-LLM: A Safety-Constrained Hybrid Framework for Clinical Diagnosis and Treatment Generation

Bikash Kumar Paul; Farman Hossain Sayem; Md. Abir Hossain; Md. Mehedi Hasan; Mohammad Shorif Uddin; Rafid Mostafiz; Ziaur Rahman

arxiv: 2510.22609 · v2 · submitted 2025-10-26 · 💻 cs.AI

CLIN-LLM: A Safety-Constrained Hybrid Framework for Clinical Diagnosis and Treatment Generation

Md. Mehedi Hasan , Md. Abir Hossain , Farman Hossain Sayem , Bikash Kumar Paul , Ziaur Rahman , Mohammad Shorif Uddin , Rafid Mostafiz This is my paper

Pith reviewed 2026-05-18 04:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords clinical diagnosistreatment recommendationuncertainty calibrationretrieval augmented generationhybrid AI frameworkmedical safetydisease classification

0 comments

The pith

CLIN-LLM uses uncertainty estimates and case retrieval to generate safer clinical diagnoses and treatments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CLIN-LLM to address risks in medical AI by combining disease classification that reports its own confidence level with treatment generation drawn from similar past cases. When the system is uncertain, it routes the input to a clinician rather than outputting a final answer. This structure matters for settings where full specialist oversight is not always available, as it aims to maintain safety through built-in checks and post-processing for drug interactions.

Core claim

CLIN-LLM integrates multimodal patient encoding, uncertainty-calibrated disease classification using a fine-tuned BioBERT model with Focal Loss and Monte Carlo Dropout, retrieval-augmented treatment generation via Biomedical Sentence-BERT and fine-tuned FLAN-T5, and RxNorm post-processing to screen for safety issues, resulting in high-accuracy predictions with reduced unsafe outputs and automatic flagging of low-certainty cases for expert review.

What carries the argument

The safety-constrained hybrid pipeline combining uncertainty-aware classification, retrieval from medical dialogues, and drug-interaction screening.

If this is right

Low-certainty cases are flagged for human expert review to provide oversight.
Treatment generation draws on retrieved relevant medical dialogues for grounding.
RxNorm integration reduces unsafe antibiotic suggestions compared to baseline models.
The pipeline is designed as a deployable decision support tool for resource-limited environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This design could be extended to incorporate additional data types such as imaging for richer patient representations.
Similar hybrid approaches with explicit uncertainty handling may apply to other domains requiring accountable AI outputs.
The retrieval step provides a way to trace generated recommendations back to specific evidence sources.

Load-bearing premise

The training and retrieval datasets sufficiently represent diverse real-world patient populations for the model to generalize without introducing new safety risks.

What would settle it

Evaluation on an independent clinical dataset from a different population where accuracy falls significantly below reported levels or unsafe treatment rates do not decrease.

Figures

Figures reproduced from arXiv: 2510.22609 by Bikash Kumar Paul, Farman Hossain Sayem, Md. Abir Hossain, Md. Mehedi Hasan, Mohammad Shorif Uddin, Rafid Mostafiz, Ziaur Rahman.

**Figure 1.** Figure 1: CLIN-LLM Framework: A two-stage pipeline comprising uncertainty-aware diagnosis via BioBERT+MCD and safety-filtered RAG-based treatment [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of 24 disease classes in Symptom2Disease with 80/20 train-validation stratification. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the Uncertainty-Aware Classification Module. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Fine-Tuned Classification Model Training Metrics over 10 Epochs. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Progression of training and validation accuracy across 10 epochs for [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 5.** Figure 5: Confusion matrix for CLIN-LLM predictions on the Symp [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 9.** Figure 9: F1-score comparison of CLIN-LLM with baseline models on Symp [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 8.** Figure 8: Loss over Epochs: Training and validation loss across epochs, showing [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Accurate symptom-to-disease classification and clinically grounded treatment recommendations remain challenging, particularly in heterogeneous patient settings with high diagnostic risk. Existing large language model (LLM)-based systems often lack medical grounding and fail to quantify uncertainty, resulting in unsafe outputs. We propose CLIN-LLM, a safety-constrained hybrid pipeline that integrates multimodal patient encoding, uncertainty-calibrated disease classification, and retrieval-augmented treatment generation. The framework fine-tunes BioBERT on 1,200 clinical cases from the Symptom2Disease dataset and incorporates Focal Loss with Monte Carlo Dropout to enable confidence-aware predictions from free-text symptoms and structured vitals. Low-certainty cases (18%) are automatically flagged for expert review, ensuring human oversight. For treatment generation, CLIN-LLM employs Biomedical Sentence-BERT to retrieve top-k relevant dialogues from the 260,000-sample MedDialog corpus. The retrieved evidence and patient context are fed into a fine-tuned FLAN-T5 model for personalized treatment generation, followed by post-processing with RxNorm for antibiotic stewardship and drug-drug interaction (DDI) screening. CLIN-LLM achieves 98% accuracy and F1 score, outperforming ClinicalBERT by 7.1% (p < 0.001), with 78% top-5 retrieval precision and a clinician-rated validity of 4.2 out of 5. Unsafe antibiotic suggestions are reduced by 67% compared to GPT-5. These results demonstrate CLIN-LLM's robustness, interpretability, and clinical safety alignment. The proposed system provides a deployable, human-in-the-loop decision support framework for resource-limited healthcare environments. Future work includes integrating imaging and lab data, multilingual extensions, and clinical trial validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CLIN-LLM, a hybrid pipeline that fine-tunes BioBERT on the Symptom2Disease dataset (1,200 cases) using Focal Loss and Monte Carlo Dropout for uncertainty-aware symptom-to-disease classification, flags low-certainty cases (18%) for human review, retrieves top-k dialogues from the 260k-sample MedDialog corpus via Biomedical Sentence-BERT, and generates treatments with a fine-tuned FLAN-T5 model followed by RxNorm-based post-processing for antibiotic stewardship and DDI checks. It claims 98% accuracy/F1 (7.1% above ClinicalBERT, p<0.001), 78% top-5 retrieval precision, 4.2/5 clinician validity, and 67% reduction in unsafe antibiotic suggestions versus GPT-5, positioning the system as a deployable, safety-aligned decision support tool for resource-limited settings.

Significance. If the performance claims hold under rigorous validation, the work would offer a concrete, human-in-the-loop architecture that combines uncertainty quantification, retrieval augmentation, and rule-based safety filters—addressing key failure modes of standalone LLMs in clinical use. The explicit flagging of uncertain cases and post-processing steps represent practical contributions to safe deployment, though the small primary dataset limits immediate claims of broad generalizability.

major comments (3)

[Abstract / §3] Abstract and §3 (Methods): The headline 98% accuracy and F1 score are obtained by fine-tuning on the 1,200-case Symptom2Disease corpus, yet no train/test split ratio, stratification by disease or demographics, cross-validation procedure, or leakage checks are described. This information is load-bearing for the robustness claim; without it the reported 7.1% gain over ClinicalBERT cannot be interpreted as evidence of generalization to heterogeneous real-world patients.
[Abstract / §4] Abstract and §4 (Experiments): The 78% top-5 retrieval precision and 67% reduction in unsafe antibiotic suggestions are measured against the fixed MedDialog corpus and GPT-5 baseline, but no external hold-out cohort, out-of-distribution test set, or prospective clinician validation is reported. The weakest assumption—that the 1,200-case and 260k-dialogue corpora are representative—therefore remains untested and directly affects the safety and deployability conclusions.
[§3.2] §3.2 (Treatment Generation): The post-processing pipeline relies on RxNorm for antibiotic stewardship and DDI screening, yet the manuscript does not quantify how often the LLM output is altered by these rules or report false-positive/negative rates of the safety filter itself. This detail is necessary to substantiate the 67% unsafe-suggestion reduction as a reliable clinical gain rather than an artifact of the filter.

minor comments (2)

[Abstract] The abstract states “Monte Carlo Dropout sampling parameters” and “top-k retrieval count” as free parameters but does not list their concrete values or sensitivity analysis; adding these in a table or appendix would improve reproducibility.
[§4] Clinician-rated validity is reported as 4.2/5 without specifying the number of raters, their specialties, or inter-rater agreement (e.g., Cohen’s κ); this should be added to §4 for transparency.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments on methodological transparency and validation scope. We address each major comment below and have revised the manuscript to add the requested details where feasible while honestly noting limitations that cannot be resolved without new experiments.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (Methods): The headline 98% accuracy and F1 score are obtained by fine-tuning on the 1,200-case Symptom2Disease corpus, yet no train/test split ratio, stratification by disease or demographics, cross-validation procedure, or leakage checks are described. This information is load-bearing for the robustness claim; without it the reported 7.1% gain over ClinicalBERT cannot be interpreted as evidence of generalization to heterogeneous real-world patients.

Authors: We agree these details are essential. In the revised manuscript we now specify in §3 that an 80/20 train/test split was used with stratification by disease category to preserve class balance. Demographics stratification was not applied because the Symptom2Disease dataset lacks demographic metadata. We performed 5-fold cross-validation on the training portion for model selection and verified no leakage by confirming unique case identifiers with zero overlap between splits. The 7.1% gain over ClinicalBERT was measured on this identical held-out test set. revision: yes
Referee: [Abstract / §4] Abstract and §4 (Experiments): The 78% top-5 retrieval precision and 67% reduction in unsafe antibiotic suggestions are measured against the fixed MedDialog corpus and GPT-5 baseline, but no external hold-out cohort, out-of-distribution test set, or prospective clinician validation is reported. The weakest assumption—that the 1,200-case and 260k-dialogue corpora are representative—therefore remains untested and directly affects the safety and deployability conclusions.

Authors: We acknowledge that external and prospective validation would be required to fully support broad deployability claims. The reported metrics were obtained on the standard Symptom2Disease and MedDialog benchmarks. We have revised §4 and added a limitations paragraph in the Discussion to explicitly state that results are corpus-specific, that the representativeness assumption remains untested on new patient populations, and that prospective clinician validation is planned as future work rather than claimed in the current study. revision: partial
Referee: [§3.2] §3.2 (Treatment Generation): The post-processing pipeline relies on RxNorm for antibiotic stewardship and DDI screening, yet the manuscript does not quantify how often the LLM output is altered by these rules or report false-positive/negative rates of the safety filter itself. This detail is necessary to substantiate the 67% unsafe-suggestion reduction as a reliable clinical gain rather than an artifact of the filter.

Authors: We have updated §3.2 and §4 to report that the post-processing rules altered the LLM-generated treatment in 28% of cases. A full false-positive/negative rate analysis of the safety filter would require additional clinician annotation of a larger sample; we have therefore noted this as a limitation and indicated it as planned follow-up work rather than providing unsubstantiated rates in the revision. revision: partial

standing simulated objections not resolved

Prospective clinician validation on real-world patient cohorts, which would require new clinical studies outside the scope of the current manuscript.

Circularity Check

0 steps flagged

No circularity: performance claims rest on empirical fine-tuning and held-out evaluation

full rationale

The paper's central results (98% accuracy/F1, 78% top-5 retrieval precision, 67% unsafe suggestion reduction) are obtained by fine-tuning BioBERT on the Symptom2Disease corpus, retrieving from the fixed MedDialog corpus, and evaluating against held-out accuracy, clinician ratings, and safety proxies. No equations, self-definitional loops, or fitted-input-called-prediction patterns appear in the abstract or described pipeline; the uncertainty calibration via Focal Loss and Monte Carlo Dropout follows standard practice without reducing the reported metrics to quantities defined by the fit itself. The derivation chain is self-contained against external benchmarks and does not rely on self-citation load-bearing or ansatz smuggling for its core claims.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework depends on standard pre-trained models and two public corpora plus added safety modules; no new physical entities are postulated, but several domain assumptions about data representativeness and transferability are required.

free parameters (2)

Monte Carlo Dropout sampling parameters
Specific dropout rate and number of samples used for uncertainty estimation are not stated in the abstract but are required for the confidence-aware predictions.
top-k retrieval count
The exact value of k for retrieving dialogues from MedDialog is not given, yet it directly affects the input to the treatment generator.

axioms (2)

domain assumption The 1,200 cases in Symptom2Disease and the 260,000 dialogues in MedDialog are representative of real heterogeneous clinical presentations.
These corpora are used for fine-tuning and retrieval without any discussion of selection bias or coverage gaps in the abstract.
domain assumption Clinician ratings of 4.2/5 and RxNorm screening are sufficient proxies for clinical safety and validity.
Safety claims rest on these post-hoc evaluations without further validation metrics shown in the abstract.

pith-pipeline@v0.9.0 · 5881 in / 1840 out tokens · 57276 ms · 2026-05-18T04:29:15.395169+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CLIN-LLM achieves 98% accuracy and F1 score, outperforming ClinicalBERT by 7.1% (p < 0.001)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

Kalra,Medical errors and patient safety: strategies to reduce and disclose medical errors and improve patient safety

J. Kalra,Medical errors and patient safety: strategies to reduce and disclose medical errors and improve patient safety. Walter de Gruyter, 2011, vol. 1

work page 2011
[2]

Big data and machine learning in health care,

A. L. Beam and I. S. Kohane, “Big data and machine learning in health care,”Jama, vol. 319, no. 13, pp. 1317–1318, 2018

work page 2018
[3]

Performance of a large language model on practice questions for the neonatal board examination,

K. Beam, P. Sharma, B. Kumar, C. Wang, D. Brodsky, C. R. Martin, and A. Beam, “Performance of a large language model on practice questions for the neonatal board examination,”JAMA pediatrics, vol. 177, no. 9, pp. 977–979, 2023

work page 2023
[4]

A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,

E. Asgari, N. Monta ˜na-Brown, M. Dubois, S. Khalil, J. Balloch, J. A. Yeung, and D. Pimenta, “A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,”npj Digital Medicine, vol. 8, no. 1, p. 274, 2025

work page 2025
[6]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908
[7]

Clinical insights: A compre- hensive review of language models in medicine,

N. Neveditsin, P. Lingras, and V . Mago, “Clinical insights: A compre- hensive review of language models in medicine,”PLOS Digital Health, vol. 4, no. 5, p. e0000800, 2025

work page 2025
[8]

Mapping global governance of antibiotic stewardship: A one health multi-level governance approach,

E. Shedeed, “Mapping global governance of antibiotic stewardship: A one health multi-level governance approach,” Ph.D. dissertation, Universit´e d’Ottawa— University of Ottawa, 2024

work page 2024
[9]

Deciphering diagnoses: how large language models explanations influence clinical decision making,

D. Umerenkov, G. Zubkova, and A. Nesterov, “Deciphering diagnoses: how large language models explanations influence clinical decision making,”arXiv preprint arXiv:2310.01708, 2023

work page arXiv 2023
[10]

Evaluating the effectiveness of the foundational models for q&a classification in mental health care,

H. Alhuzali and A. Alasmari, “Evaluating the effectiveness of the foundational models for q&a classification in mental health care,”arXiv preprint arXiv:2406.15966, 2024

work page arXiv 2024
[11]

Assessing risk in implementing new artificial intelligence triage tools—how much risk is reasonable in an already risky world?

A. Nord-Bronzyk, J. Savulescu, A. Ballantyne, A. Braunack-Mayer, P. Krishnaswamy, T. Lysaght, M. E. Ong, N. Liu, J. Menikoff, M. Mertenset al., “Assessing risk in implementing new artificial intelligence triage tools—how much risk is reasonable in an already risky world?”Asian bioethics review, vol. 17, no. 1, pp. 187–205, 2025

work page 2025
[12]

Machine learning and artificial intelligence in intensive care medicine: Critical recalibrations from rule-based systems to frontier models,

P. Hadweh, A. Niset, M. Salvagno, M. Al Barajraji, S. El Hadwe, F. S. Taccone, and S. Barrit, “Machine learning and artificial intelligence in intensive care medicine: Critical recalibrations from rule-based systems to frontier models,”Journal of Clinical Medicine, vol. 14, no. 12, p. 4026, 2025

work page 2025
[13]

Publicly Available Clinical BERT Embeddings

E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, and M. McDermott, “Publicly available clinical bert embeddings,”arXiv preprint arXiv:1904.03323, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[14]

Enhancing early detection of cognitive impairment from clinical notes using fine-tuned transformers and uncertainty-driven annotation,

B. L. Wibowo and F. U. T. Nugroho, “Enhancing early detection of cognitive impairment from clinical notes using fine-tuned transformers and uncertainty-driven annotation,”Precision Health: Machine Learn- ing, vol. 1, no. 1, pp. 9–18, 2025

work page 2025
[15]

Biobert: a pre-trained biomedical language representation model for biomedical text mining,

J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical text mining,”Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020

work page 2020
[16]

Deep learning for healthcare: review, opportunities and challenges,

R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, “Deep learning for healthcare: review, opportunities and challenges,”Briefings in bioinformatics, vol. 19, no. 6, pp. 1236–1246, 2018

work page 2018
[17]

arXiv preprint arXiv:2406.03712

L. Liu, X. Yang, J. Lei, Y . Shen, J. Wang, P. Wei, Z. Chu, Z. Qin, and K. Ren, “A survey on medical large language models: Technol- ogy, application, trustworthiness, and future directions,”arXiv preprint arXiv:2406.03712, 2024

work page arXiv 2024
[18]

Almanac—retrieval-augmented language models for clinical medicine,

C. Zakka, R. Shad, A. Chaurasia, A. R. Dalal, J. L. Kim, M. Moor, R. Fong, C. Phillips, K. Alexander, E. Ashleyet al., “Almanac—retrieval-augmented language models for clinical medicine,” Nejm ai, vol. 1, no. 2, p. AIoa2300068, 2024

work page 2024
[19]

Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities,

J. Liu, D. Capurro, A. Nguyen, and K. Verspoor, “Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities,”Journal of biomedical informatics, vol. 145, p. 104466, 2023

work page 2023
[20]

M3t-lm: A multi-modal multi- task learning model for jointly predicting patient length of stay and mortality,

J. Chen, Q. Li, F. Liu, and Y . Wen, “M3t-lm: A multi-modal multi- task learning model for jointly predicting patient length of stay and mortality,”Computers in Biology and Medicine, vol. 183, p. 109237, 2024

work page 2024
[21]

Tool calling: Enhancing medication consulta- tion via retrieval-augmented large language models,

Z. Huang, K. Xue, Y . Fan, L. Mu, R. Liu, T. Ruan, S. Zhang, and X. Zhang, “Tool calling: Enhancing medication consulta- tion via retrieval-augmented large language models,”arXiv preprint arXiv:2404.17897, 2024

work page arXiv 2024
[22]

Retrieval-augmented and knowledge-grounded language models for faithful clinical medicine,

F. Liu, B. Yang, C. You, X. Wu, S. Ge, Z. Liu, X. Sun, Y . Yang, and D. A. Clifton, “Retrieval-augmented and knowledge-grounded language models for faithful clinical medicine,”arXiv preprint arXiv:2210.12777, 2022

work page arXiv 2022
[23]

Medbiolm: Optimizing medical and biological qa with fine- tuned large language models and retrieval-augmented generation,

S. Kim, “Medbiolm: Optimizing medical and biological qa with fine- tuned large language models and retrieval-augmented generation,”arXiv preprint arXiv:2502.03004, 2025

work page arXiv 2025
[24]

arXiv preprint arXiv:2408.04187 (2024) Medical Latent Memory Evolution 31

J. Wu, J. Zhu, Y . Qi, J. Chen, M. Xu, F. Menolascina, and V . Grau, “Medical graph rag: Towards safe medical large lan- guage model via graph retrieval-augmented generation,”arXiv preprint arXiv:2408.04187, 2024

work page arXiv 2024
[25]

Clinical entity augmented retrieval for clinical information extraction,

I. Lopez, A. Swaminathan, K. Vedula, S. Narayanan, F. Nateghi Haredasht, S. P. Ma, A. S. Liang, S. Tate, M. Maddali, R. J. Galloet al., “Clinical entity augmented retrieval for clinical information extraction,”npj Digital Medicine, vol. 8, no. 1, p. 45, 2025

work page 2025
[26]

Uncertainty-aware large language mod- els for explainable disease diagnosis,

S. Zhou, J. Wang, Z. Xu, S. Wang, D. Brauer, L. Welton, J. Cogan, Y .-H. Chung, L. Tian, Z. Zhanet al., “Uncertainty-aware large language mod- els for explainable disease diagnosis,”arXiv preprint arXiv:2505.03467, 2025

work page arXiv 2025
[27]

Icu readmission predic- tion for intracerebral hemorrhage patients using mimic iii and mimic iv databases,

H. Li, R. Monger, E. Pishgar, and M. Pishgar, “Icu readmission predic- tion for intracerebral hemorrhage patients using mimic iii and mimic iv databases,”medRxiv, pp. 2025–01, 2025

work page 2025
[28]

Adams,Generating Faithful and Complete Hospital-Course Sum- maries from the Electronic Health Record

G. Adams,Generating Faithful and Complete Hospital-Course Sum- maries from the Electronic Health Record. Columbia University, 2024

work page 2024
[29]

Multimodal feature fusion based thoracic disease clas- sification framework combining medical data and chest x-ray images,

N. B. Nizam, “Multimodal feature fusion based thoracic disease clas- sification framework combining medical data and chest x-ray images,” 2023

work page 2023
[30]

From extraction to reasoning: A systematic review of algorithms in multi-document sum- marization and qa,

E. Efosa-Zuwa, O. Oladipupo, and J. Oyelade, “From extraction to reasoning: A systematic review of algorithms in multi-document sum- marization and qa,”Statistics, Optimization & Information Computing, vol. 13, no. 6, pp. 2529–2559, 2025

work page 2025
[31]

A. M. Vahdani, M. Shariatnia, P. Rajpurkar, and A. Pareek, “To- wards trustworthy artificial intelligence in musculoskeletal medicine: A IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. XX, NO. X, OCTOBER 2025 13 narrative review on uncertainty quantification,”Knee Surgery, Sports Traumatology, Arthroscopy, vol. 33, no. 9, pp. 3418–3437, 2025

work page 2025
[32]

Symptom2disease: Diseases and natural language symptom descriptions,

N. R. Barman, “Symptom2disease: Diseases and natural language symptom descriptions,” https://www.kaggle.com/datasets/niyarrbarman/ symptom2disease/data, 2023, accessed: YYYY-MM-DD

work page 2023
[34]

arXiv preprint arXiv:2004.03329 (2020)

[Online]. Available: https://arxiv.org/abs/2004.03329

work page arXiv 2004
[35]

Focal loss for dense object detection,

T.-Y . Ross, G. Doll ´aret al., “Focal loss for dense object detection,” inproceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 2017, pp. 2980–2988

work page 2017
[36]

Language model-based deep learning for automated disease prediction from symptoms,

R. Sarkar, A. Hossain, and A. Z. Ifti, “Language model-based deep learning for automated disease prediction from symptoms,” in2023 26th International Conference on Computer and Information Technology (ICCIT). IEEE, 2023, pp. 1–6

work page 2023
[37]

An advanced nlp framework for automated medical diagnosis with deberta and dynamic contextual positional gating,

M. A. L. Khaniki, S. Saadati, and M. Manthouri, “An advanced nlp framework for automated medical diagnosis with deberta and dynamic contextual positional gating,”arXiv preprint arXiv:2502.07755, 2025

work page arXiv 2025
[38]

Exploring explainable machine learning in health- care: Closing the predictive accuracy and clinical interpretability gap,

G. Singh and A. Pal, “Exploring explainable machine learning in health- care: Closing the predictive accuracy and clinical interpretability gap,” inThe International Conference on Recent Innovations in Computing. Springer, 2023, pp. 167–182

work page 2023
[39]

Optimizing classi- fication of diseases through language model analysis of symptoms,

E. Hassan, T. Abd El-Hafeez, and M. Y . Shams, “Optimizing classi- fication of diseases through language model analysis of symptoms,” Scientific reports, vol. 14, no. 1, p. 1507, 2024

work page 2024

[1] [1]

Kalra,Medical errors and patient safety: strategies to reduce and disclose medical errors and improve patient safety

J. Kalra,Medical errors and patient safety: strategies to reduce and disclose medical errors and improve patient safety. Walter de Gruyter, 2011, vol. 1

work page 2011

[2] [2]

Big data and machine learning in health care,

A. L. Beam and I. S. Kohane, “Big data and machine learning in health care,”Jama, vol. 319, no. 13, pp. 1317–1318, 2018

work page 2018

[3] [3]

Performance of a large language model on practice questions for the neonatal board examination,

K. Beam, P. Sharma, B. Kumar, C. Wang, D. Brodsky, C. R. Martin, and A. Beam, “Performance of a large language model on practice questions for the neonatal board examination,”JAMA pediatrics, vol. 177, no. 9, pp. 977–979, 2023

work page 2023

[4] [4]

A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,

E. Asgari, N. Monta ˜na-Brown, M. Dubois, S. Khalil, J. Balloch, J. A. Yeung, and D. Pimenta, “A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,”npj Digital Medicine, vol. 8, no. 1, p. 274, 2025

work page 2025

[5] [6]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908

[6] [7]

Clinical insights: A compre- hensive review of language models in medicine,

N. Neveditsin, P. Lingras, and V . Mago, “Clinical insights: A compre- hensive review of language models in medicine,”PLOS Digital Health, vol. 4, no. 5, p. e0000800, 2025

work page 2025

[7] [8]

Mapping global governance of antibiotic stewardship: A one health multi-level governance approach,

E. Shedeed, “Mapping global governance of antibiotic stewardship: A one health multi-level governance approach,” Ph.D. dissertation, Universit´e d’Ottawa— University of Ottawa, 2024

work page 2024

[8] [9]

Deciphering diagnoses: how large language models explanations influence clinical decision making,

D. Umerenkov, G. Zubkova, and A. Nesterov, “Deciphering diagnoses: how large language models explanations influence clinical decision making,”arXiv preprint arXiv:2310.01708, 2023

work page arXiv 2023

[9] [10]

Evaluating the effectiveness of the foundational models for q&a classification in mental health care,

H. Alhuzali and A. Alasmari, “Evaluating the effectiveness of the foundational models for q&a classification in mental health care,”arXiv preprint arXiv:2406.15966, 2024

work page arXiv 2024

[10] [11]

Assessing risk in implementing new artificial intelligence triage tools—how much risk is reasonable in an already risky world?

A. Nord-Bronzyk, J. Savulescu, A. Ballantyne, A. Braunack-Mayer, P. Krishnaswamy, T. Lysaght, M. E. Ong, N. Liu, J. Menikoff, M. Mertenset al., “Assessing risk in implementing new artificial intelligence triage tools—how much risk is reasonable in an already risky world?”Asian bioethics review, vol. 17, no. 1, pp. 187–205, 2025

work page 2025

[11] [12]

Machine learning and artificial intelligence in intensive care medicine: Critical recalibrations from rule-based systems to frontier models,

P. Hadweh, A. Niset, M. Salvagno, M. Al Barajraji, S. El Hadwe, F. S. Taccone, and S. Barrit, “Machine learning and artificial intelligence in intensive care medicine: Critical recalibrations from rule-based systems to frontier models,”Journal of Clinical Medicine, vol. 14, no. 12, p. 4026, 2025

work page 2025

[12] [13]

Publicly Available Clinical BERT Embeddings

E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, and M. McDermott, “Publicly available clinical bert embeddings,”arXiv preprint arXiv:1904.03323, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[13] [14]

Enhancing early detection of cognitive impairment from clinical notes using fine-tuned transformers and uncertainty-driven annotation,

B. L. Wibowo and F. U. T. Nugroho, “Enhancing early detection of cognitive impairment from clinical notes using fine-tuned transformers and uncertainty-driven annotation,”Precision Health: Machine Learn- ing, vol. 1, no. 1, pp. 9–18, 2025

work page 2025

[14] [15]

Biobert: a pre-trained biomedical language representation model for biomedical text mining,

J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical text mining,”Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020

work page 2020

[15] [16]

Deep learning for healthcare: review, opportunities and challenges,

R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, “Deep learning for healthcare: review, opportunities and challenges,”Briefings in bioinformatics, vol. 19, no. 6, pp. 1236–1246, 2018

work page 2018

[16] [17]

arXiv preprint arXiv:2406.03712

L. Liu, X. Yang, J. Lei, Y . Shen, J. Wang, P. Wei, Z. Chu, Z. Qin, and K. Ren, “A survey on medical large language models: Technol- ogy, application, trustworthiness, and future directions,”arXiv preprint arXiv:2406.03712, 2024

work page arXiv 2024

[17] [18]

Almanac—retrieval-augmented language models for clinical medicine,

C. Zakka, R. Shad, A. Chaurasia, A. R. Dalal, J. L. Kim, M. Moor, R. Fong, C. Phillips, K. Alexander, E. Ashleyet al., “Almanac—retrieval-augmented language models for clinical medicine,” Nejm ai, vol. 1, no. 2, p. AIoa2300068, 2024

work page 2024

[18] [19]

Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities,

J. Liu, D. Capurro, A. Nguyen, and K. Verspoor, “Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities,”Journal of biomedical informatics, vol. 145, p. 104466, 2023

work page 2023

[19] [20]

M3t-lm: A multi-modal multi- task learning model for jointly predicting patient length of stay and mortality,

J. Chen, Q. Li, F. Liu, and Y . Wen, “M3t-lm: A multi-modal multi- task learning model for jointly predicting patient length of stay and mortality,”Computers in Biology and Medicine, vol. 183, p. 109237, 2024

work page 2024

[20] [21]

Tool calling: Enhancing medication consulta- tion via retrieval-augmented large language models,

Z. Huang, K. Xue, Y . Fan, L. Mu, R. Liu, T. Ruan, S. Zhang, and X. Zhang, “Tool calling: Enhancing medication consulta- tion via retrieval-augmented large language models,”arXiv preprint arXiv:2404.17897, 2024

work page arXiv 2024

[21] [22]

Retrieval-augmented and knowledge-grounded language models for faithful clinical medicine,

F. Liu, B. Yang, C. You, X. Wu, S. Ge, Z. Liu, X. Sun, Y . Yang, and D. A. Clifton, “Retrieval-augmented and knowledge-grounded language models for faithful clinical medicine,”arXiv preprint arXiv:2210.12777, 2022

work page arXiv 2022

[22] [23]

Medbiolm: Optimizing medical and biological qa with fine- tuned large language models and retrieval-augmented generation,

S. Kim, “Medbiolm: Optimizing medical and biological qa with fine- tuned large language models and retrieval-augmented generation,”arXiv preprint arXiv:2502.03004, 2025

work page arXiv 2025

[23] [24]

arXiv preprint arXiv:2408.04187 (2024) Medical Latent Memory Evolution 31

J. Wu, J. Zhu, Y . Qi, J. Chen, M. Xu, F. Menolascina, and V . Grau, “Medical graph rag: Towards safe medical large lan- guage model via graph retrieval-augmented generation,”arXiv preprint arXiv:2408.04187, 2024

work page arXiv 2024

[24] [25]

Clinical entity augmented retrieval for clinical information extraction,

I. Lopez, A. Swaminathan, K. Vedula, S. Narayanan, F. Nateghi Haredasht, S. P. Ma, A. S. Liang, S. Tate, M. Maddali, R. J. Galloet al., “Clinical entity augmented retrieval for clinical information extraction,”npj Digital Medicine, vol. 8, no. 1, p. 45, 2025

work page 2025

[25] [26]

Uncertainty-aware large language mod- els for explainable disease diagnosis,

S. Zhou, J. Wang, Z. Xu, S. Wang, D. Brauer, L. Welton, J. Cogan, Y .-H. Chung, L. Tian, Z. Zhanet al., “Uncertainty-aware large language mod- els for explainable disease diagnosis,”arXiv preprint arXiv:2505.03467, 2025

work page arXiv 2025

[26] [27]

Icu readmission predic- tion for intracerebral hemorrhage patients using mimic iii and mimic iv databases,

H. Li, R. Monger, E. Pishgar, and M. Pishgar, “Icu readmission predic- tion for intracerebral hemorrhage patients using mimic iii and mimic iv databases,”medRxiv, pp. 2025–01, 2025

work page 2025

[27] [28]

Adams,Generating Faithful and Complete Hospital-Course Sum- maries from the Electronic Health Record

G. Adams,Generating Faithful and Complete Hospital-Course Sum- maries from the Electronic Health Record. Columbia University, 2024

work page 2024

[28] [29]

Multimodal feature fusion based thoracic disease clas- sification framework combining medical data and chest x-ray images,

N. B. Nizam, “Multimodal feature fusion based thoracic disease clas- sification framework combining medical data and chest x-ray images,” 2023

work page 2023

[29] [30]

From extraction to reasoning: A systematic review of algorithms in multi-document sum- marization and qa,

E. Efosa-Zuwa, O. Oladipupo, and J. Oyelade, “From extraction to reasoning: A systematic review of algorithms in multi-document sum- marization and qa,”Statistics, Optimization & Information Computing, vol. 13, no. 6, pp. 2529–2559, 2025

work page 2025

[30] [31]

A. M. Vahdani, M. Shariatnia, P. Rajpurkar, and A. Pareek, “To- wards trustworthy artificial intelligence in musculoskeletal medicine: A IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. XX, NO. X, OCTOBER 2025 13 narrative review on uncertainty quantification,”Knee Surgery, Sports Traumatology, Arthroscopy, vol. 33, no. 9, pp. 3418–3437, 2025

work page 2025

[31] [32]

Symptom2disease: Diseases and natural language symptom descriptions,

N. R. Barman, “Symptom2disease: Diseases and natural language symptom descriptions,” https://www.kaggle.com/datasets/niyarrbarman/ symptom2disease/data, 2023, accessed: YYYY-MM-DD

work page 2023

[32] [34]

arXiv preprint arXiv:2004.03329 (2020)

[Online]. Available: https://arxiv.org/abs/2004.03329

work page arXiv 2004

[33] [35]

Focal loss for dense object detection,

T.-Y . Ross, G. Doll ´aret al., “Focal loss for dense object detection,” inproceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 2017, pp. 2980–2988

work page 2017

[34] [36]

Language model-based deep learning for automated disease prediction from symptoms,

R. Sarkar, A. Hossain, and A. Z. Ifti, “Language model-based deep learning for automated disease prediction from symptoms,” in2023 26th International Conference on Computer and Information Technology (ICCIT). IEEE, 2023, pp. 1–6

work page 2023

[35] [37]

An advanced nlp framework for automated medical diagnosis with deberta and dynamic contextual positional gating,

M. A. L. Khaniki, S. Saadati, and M. Manthouri, “An advanced nlp framework for automated medical diagnosis with deberta and dynamic contextual positional gating,”arXiv preprint arXiv:2502.07755, 2025

work page arXiv 2025

[36] [38]

Exploring explainable machine learning in health- care: Closing the predictive accuracy and clinical interpretability gap,

G. Singh and A. Pal, “Exploring explainable machine learning in health- care: Closing the predictive accuracy and clinical interpretability gap,” inThe International Conference on Recent Innovations in Computing. Springer, 2023, pp. 167–182

work page 2023

[37] [39]

Optimizing classi- fication of diseases through language model analysis of symptoms,

E. Hassan, T. Abd El-Hafeez, and M. Y . Shams, “Optimizing classi- fication of diseases through language model analysis of symptoms,” Scientific reports, vol. 14, no. 1, p. 1507, 2024

work page 2024