Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

Hieu Tran; Hong Yu; Sharmin Sultana; Sunjae Kwon; Won Seok Jang; Zhichao Yang; Zonghai Yao

arxiv: 2502.16022 · v2 · submitted 2025-02-22 · 💻 cs.CL

Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

Won Seok Jang , Sharmin Sultana , Zonghai Yao , Hieu Tran , Zhichao Yang , Sunjae Kwon , Hong Yu This is my paper

Pith reviewed 2026-05-23 01:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords medical jargon extractionLLM fine-tuningdata augmentationEHR notesterm prioritizationOpenNotesF1 scoremean reciprocal rank

0 comments

The pith

Fine-tuning and data augmentation let open-source LLMs outperform closed-source models at extracting and ranking medical jargon in EHR notes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors test closed-source and open-source large language models on the task of finding and ranking the most important medical terms in electronic health record notes. They compare basic prompting, few-shot examples, structured instructions, fine-tuning the models, and using ChatGPT to create extra training data. The key result is that fine-tuning and data augmentation give the biggest gains, with open-source models that receive this treatment beating closed-source models on the ranking metric. This matters for making medical notes more readable for patients who access them through OpenNotes.

Core claim

Experiments on 106 expert-annotated EHR notes show that fine-tuning and data augmentation improve LLM performance for extracting and prioritizing medical jargon, with GPT-4 Turbo reaching the highest F1 score of 0.433 and Mistral7B with augmentation achieving the highest MRR of 0.746; open-source models enhanced this way surpass closed-source models. Few-shot prompting outperforms zero-shot in vanilla models, structured prompts yield different preferences across models, and fine-tuning improves zero-shot performance but sometimes degrades few-shot performance. Data augmentation performs comparably or better than other methods.

What carries the argument

Data augmentation generated by ChatGPT to expand training sets from 10 to 10,000 samples, paired with fine-tuning and ranking techniques evaluated via 5-fold cross-validation on F1 score and mean reciprocal rank.

Load-bearing premise

The 106 expert-annotated EHR notes form a sufficient and representative sample for measuring how well the models identify and prioritize medical jargon.

What would settle it

Re-evaluating the same models and methods on a new collection of several hundred EHR notes drawn from different hospitals or regions and finding that the reported performance ordering of strategies reverses or the absolute scores fall below those of plain prompting.

Figures

Figures reproduced from arXiv: 2502.16022 by Hieu Tran, Hong Yu, Sharmin Sultana, Sunjae Kwon, Won Seok Jang, Zhichao Yang, Zonghai Yao.

**Figure 2.** Figure 2: A sample EHR note where physicians identified important medical terms. Diagnoses/conditions are high [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Case Study for Extracting the Top 3 Important Medical Jargons from Zero-shot and Few-shot Prompts in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Case Study for extracting Top 5 important medical jargons from BioMistral7B and BioMistral7B that was [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Case Study for extracting Top 5 important medical jargons from Llama3.1 8B finetuned and Llama 3.1 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Structured Prompt ### Instruction: You are a helpful assistant, an expert in medical domain. Extract top 3 key terms mentioned in the medical note that are important for the patient. If you think they are of same importance, they can have the same ranking. Do not write no symptoms, or any indication that there is no other diagnosis/symptoms or conditions. Do not modify or abbreviate what is written in the … view at source ↗

**Figure 7.** Figure 7: General Prompt [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt used for querying GPT-3.5 Turbo for data augmentation [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

read the original abstract

OpenNotes enables patients to access EHR notes, but medical jargon can hinder comprehension. To improve understanding, we evaluated closed- and open-source LLMs for extracting and prioritizing key medical terms using prompting, fine-tuning, and data augmentation. We assessed LLMs on 106 expert-annotated EHR notes, experimenting with (i) general vs. structured prompts, (ii) zero-shot vs. few-shot prompting, (iii) fine-tuning, and (iv) data augmentation. To enhance open-source models in low-resource settings, we used ChatGPT for data augmentation and applied ranking techniques. We incrementally increased the augmented dataset size (10 to 10,000) and conducted 5-fold cross-validation, reporting F1 score and Mean Reciprocal Rank (MRR). Our result show that fine-tuning and data augmentation improved performance over other strategies. GPT-4 Turbo achieved the highest F1 (0.433), while Mistral7B with data augmentation had the highest MRR (0.746). Open-source models, when fine-tuned or augmented, outperformed closed-source models. Notably, the best F1 and MRR scores did not always align. Few-shot prompting outperformed zero-shot in vanilla models, and structured prompts yielded different preferences across models. Fine-tuning improved zero-shot performance but sometimes degraded few-shot performance. Data augmentation performed comparably or better than other methods. Our evaluation highlights the effectiveness of prompting, fine-tuning, and data augmentation in improving model performance for medical jargon extraction in low-resource scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning and ChatGPT augmentation improve F1 and MRR on medical jargon extraction from EHR notes, but the results sit on only 106 annotated examples with no sampling or agreement details.

read the letter

The paper tests prompting variants, fine-tuning, and scaling synthetic data up to 10k examples on the task of pulling and ranking important medical terms from patient notes. GPT-4 Turbo hits the best F1 at 0.433 while an augmented Mistral-7B reaches the best MRR at 0.746, and open-source models pull ahead after the interventions. They run 5-fold cross-validation and track both extraction and ranking metrics, which is a straightforward way to compare the approaches on the same data.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates closed- and open-source LLMs for extracting and prioritizing medical jargon from EHR notes. Experiments on 106 expert-annotated notes compare general vs. structured prompts, zero- vs. few-shot prompting, fine-tuning, and ChatGPT-based data augmentation (scaled from 10 to 10k examples). 5-fold cross-validation is used to report F1 and MRR; the authors conclude that fine-tuning and augmentation improve results over prompting alone, GPT-4 Turbo reaches the highest F1 (0.433), Mistral-7B with augmentation reaches the highest MRR (0.746), and augmented open-source models can outperform closed-source ones.

Significance. If the performance gains hold on larger, more diverse EHR corpora, the work could provide practical guidance for low-resource medical-jargon extraction pipelines that combine prompting, fine-tuning, and synthetic data, potentially aiding patient comprehension of OpenNotes.

major comments (3)

[Abstract / Methods] The entire evaluation rests on 5-fold CV over only 106 notes (Abstract and Methods). No information is given on sampling procedure, fraction of the source EHR corpus represented, medical sub-domain coverage, or inter-annotator agreement; because all augmented data (up to 10k examples) is generated from this same seed, any selection bias is amplified rather than mitigated.
[Abstract / Results] No baseline systems (rule-based term extractors, standard biomedical NER models, or simpler ranking methods) are reported, nor are statistical significance tests or confidence intervals provided for the F1/MRR differences (Abstract and Results). This makes it impossible to determine whether the claimed improvements over prompting are reliable.
[Results] The claim that data augmentation “performed comparably or better” and that open-source models “outperformed closed-source models” when augmented is load-bearing for the paper’s contribution, yet the evaluation provides no external validation set or out-of-distribution test to separate modeling effects from idiosyncrasies of the 106-note distribution.

minor comments (2)

[Abstract] Typo: “Our result show” should read “Our results show.”
[Abstract] The abstract states that “fine-tuning improved zero-shot performance but sometimes degraded few-shot performance,” yet no quantitative breakdown or table is referenced to support this observation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment point by point below, indicating where revisions to the manuscript are planned.

read point-by-point responses

Referee: [Abstract / Methods] The entire evaluation rests on 5-fold CV over only 106 notes (Abstract and Methods). No information is given on sampling procedure, fraction of the source EHR corpus represented, medical sub-domain coverage, or inter-annotator agreement; because all augmented data (up to 10k examples) is generated from this same seed, any selection bias is amplified rather than mitigated.

Authors: We agree that the manuscript would benefit from additional dataset details. The 106 notes were randomly sampled from EHR notes at a single academic medical center and cover multiple clinical sub-domains, but the exact fraction of the source corpus and inter-annotator agreement statistics were not reported. In the revised version we will add a dedicated dataset subsection describing the sampling procedure, corpus fraction (where known), sub-domain coverage, and any available inter-annotator agreement figures. We will also expand the discussion to acknowledge that augmentation from the same seed can amplify selection bias and note this as a limitation of the current low-resource setting. revision: partial
Referee: [Abstract / Results] No baseline systems (rule-based term extractors, standard biomedical NER models, or simpler ranking methods) are reported, nor are statistical significance tests or confidence intervals provided for the F1/MRR differences (Abstract and Results). This makes it impossible to determine whether the claimed improvements over prompting are reliable.

Authors: This observation is correct. The revised manuscript will include comparisons against at least two baselines: a rule-based medical term extractor using UMLS and a standard biomedical NER model (BioBERT). We will also add statistical significance testing (paired t-tests across the five folds) and 95% confidence intervals for all reported F1 and MRR differences to allow readers to assess the reliability of the observed gains. revision: yes
Referee: [Results] The claim that data augmentation “performed comparably or better” and that open-source models “outperformed closed-source models” when augmented is load-bearing for the paper’s contribution, yet the evaluation provides no external validation set or out-of-distribution test to separate modeling effects from idiosyncrasies of the 106-note distribution.

Authors: We acknowledge that the lack of an external or out-of-distribution test set is a genuine limitation. The study was designed around a low-resource scenario with only 106 expert-annotated notes; an external validation set was not available. The 5-fold cross-validation therefore serves as the primary internal evaluation. In the revised discussion we will explicitly qualify the claims by stating that the reported improvements are observed within this distribution and recommend future validation on larger, multi-institutional corpora. No new external data will be added at this time. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with standard CV

full rationale

The paper reports results from an empirical evaluation of LLMs on a fixed set of 106 expert-annotated EHR notes. It compares prompting strategies, fine-tuning, and data augmentation (generated via ChatGPT from the seed notes) using 5-fold cross-validation to compute F1 and MRR. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential claims exist. All reported metrics are computed directly from model outputs on held-out folds; the augmentation process does not create a definitional loop because test performance is measured on unseen notes. No load-bearing self-citations or uniqueness theorems are invoked. The central claims rest on observable experimental outcomes rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an applied empirical study whose central claims rest on the quality of expert annotations and the assumption that automatic metrics reflect real patient utility.

axioms (1)

domain assumption Expert annotations on the 106 EHR notes constitute reliable ground truth for what counts as important medical jargon.
All reported F1 and MRR scores are computed directly against these annotations.

pith-pipeline@v0.9.0 · 5831 in / 1192 out tokens · 54032 ms · 2026-05-23T01:52:23.795765+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · 11 internal anchors

[1]

Delbanco, T. et al. Open notes: doctors and patients signing on (2010)

work page 2010
[2]

About the blue button movement

HealthIT.gov. About the blue button movement. https://www.healthit.gov/patients-families/ about-blue-button-movement (2024). [accessed 2024-10-29]

work page 2024
[3]

Delbanco, T. et al. Inviting patients to read their doctors’ notes: a quasi-experimental study and a look ahead. Annals of internal medicine 157, 461–470 (2012)

work page 2012
[4]

21st century cures act

Gabay, M. 21st century cures act. Hospital pharmacy 52, 264–265 (2017)

work page 2017
[5]

& Williams, B

Bajwa, J., Munir, U., Nori, A. & Williams, B. Artificial intelligence in healthcare: transforming the practice of medicine. Future healthcare journal 8, e188 (2021)

work page 2021
[6]

T., Forman, H

Lye, C. T., Forman, H. P., Daniel, J. G. & Krumholz, H. M. The 21st century cures act and electronic health records one year later: will patients see the benefits? Journal of the American Medical Informatics Association 25, 1218–1220 (2018)

work page 2018
[7]

Arvisais-Anhalt, S. et al. The 21st century cures act and multiuser electronic health record access: potential pitfalls of information release. Journal of medical Internet research 24, e34085 (2022)

work page 2022
[8]

A., Clark, C

Rodriguez, J. A., Clark, C. R. & Bates, D. W. Digital health equity as a necessity in the 21st century cures act era. Jama 323, 2381–2382 (2020)

work page 2020
[9]

Artificial intelligence and health literacy—proceed with caution

Nutbeam, D. Artificial intelligence and health literacy—proceed with caution. Health Literacy and Communi- cation Open 1, 2263355 (2023)

work page 2023
[10]

Root, J. et al. Characteristics of patients who report confusion after reading their primary care clinic notes online. Health communication 31, 778–781 (2016)

work page 2016
[11]

Kayastha, N., Pollak, K. I. & LeBlanc, T. W. Open oncology notes: a qualitative study of oncology patients’ experiences reading their cancer care notes. Journal of Oncology Practice 14, e251–e258 (2018)

work page 2018
[12]

Kujala, S. et al. Patients’ experiences of web-based access to electronic health records in finland: Cross-sectional survey. Journal of Medical Internet Research 24, e37438 (2022)

work page 2022
[13]

Choudhry, A. J. et al. Readability of discharge summaries: with what level of information are we dismissing our patients? The American Journal of Surgery 211, 631–636 (2016)

work page 2016
[14]

& Mazur, L

Khasawneh, A., Kratzke, I., Adapa, K., Marks, L. & Mazur, L. Effect of notes’ access and complexity on opennotes’ utility. Applied Clinical Informatics 13, 1015–1023 (2022)

work page 2022
[15]

Rahimian, M. et al. Open notes sounds great, but will a provider’s documentation change? an exploratory study of the effect of open notes on oncology documentation. JAMIA open 4, ooab051 (2021)

work page 2021
[16]

Zheng, J. & Yu, H. Readability formulas and user perceptions of electronic health records difficulty: a corpus study. Journal of medical Internet research 19, e59 (2017)

work page 2017
[17]

Zeng-Treitler, Q. et al. Text characteristics of clinical reports and their implications for the readability of personal health records. Studies in health technology and informatics 129, 1117 (2007)

work page 2007
[18]

Polepalli Ramesh, B., Houston, T., Brandt, C., Fang, H. & Yu, H. Improving patients’ electronic health record comprehension with noteaid. In MEDINFO 2013, 714–718 (IOS Press, 2013)

work page 2013
[19]

Sarzynski, E. et al. Opportunities to improve clinical summaries for patients at hospital discharge. BMJ quality & safety 26, 372–380 (2017)

work page 2017
[20]

C., Doak, L

Doak, C. C., Doak, L. G. & Root, J. H. Teaching patients with low literacy skills. AJN The American Journal of Nursing 96, 16M (1996)

work page 1996
[21]

C., Doak, L

Doak, C. C., Doak, L. G., Friedell, G. H. & Meade, C. D. Improving comprehension for cancer patients with low literacy skills: strategies for clinicians. CA: A Cancer Journal for Clinicians 48, 151–162 (1998)

work page 1998
[22]

Walsh, T. M. & V olsko, T. A. Readability assessment of internet-based consumer health information.Respiratory care 53, 1310–1315 (2008)

work page 2008
[23]

E., Han, A., Truntzer, J

Eltorai, A. E., Han, A., Truntzer, J. & Daniels, A. H. Readability of patient education materials on the american orthopaedic society for sports medicine website. The Physician and Sportsmedicine 42, 125–130 (2014)

work page 2014
[24]

J., Jansen, J

Morony, S., Flynn, M., McCaffery, K. J., Jansen, J. & Webster, A. C. Readability of written materials for ckd patients: a systematic review. American Journal of Kidney Diseases 65, 842–850 (2015)

work page 2015
[25]

B., Farach, F

Johnson, S. B., Farach, F. J., Pelphrey, K. & Rozenblit, L. Data management in clinical research: synthesizing stakeholder perspectives. Journal of biomedical informatics 60, 286–293 (2016)

work page 2016
[26]

A., Fiszman, M., Raja, K., Jonnalagadda, S

Morid, M. A., Fiszman, M., Raja, K., Jonnalagadda, S. R. & Del Fiol, G. Classification of clinically useful sentences in clinical evidence resources. Journal of biomedical informatics 60, 14–22 (2016)

work page 2016
[27]

& Zeng-Treitler, Q

Kandula, S., Curtis, D. & Zeng-Treitler, Q. A semantic and syntactic text simplification tool for health content. In AMIA annual symposium proceedings, vol. 2010, 366 (American Medical Informatics Association, 2010)

work page 2010
[28]

& Rosendale, D

Zeng-Treitler, Q., Goryachev, S., Kim, H., Keselman, A. & Rosendale, D. Making texts in electronic health records comprehensible to consumers: a prototype translator. In AMIA Annual Symposium Proceedings , vol. 2007, 846 (American Medical Informatics Association, 2007)

work page 2007
[29]

& Kvist, M

Abrahamsson, E., Forni, T., Skeppstedt, M. & Kvist, M. Medical text simplification using synonym replacement: Adapting assessment of word difficulty to a compounding language. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), 57–65 (2014)

work page 2014
[30]

Zheng, J. & Yu, H. Methods for linking ehr notes to education materials. Information Retrieval Journal 19, 174–188 (2016)

work page 2016
[31]

Chen, J. et al. A natural language processing system that links medical terms in electronic health record notes to lay definitions: system development using physician reviews. Journal of medical Internet research 20, e26 (2018)

work page 2018
[32]

Kwon, S. et al. Medjex: A medical jargon extraction model with wiki’s hyperlink span and contextualized masked language model score. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2022, 11733 (NIH Public Access, 2022)

work page 2022
[33]

E., Mouradi, O., Kauchak, D

Leroy, G., Endicott, J. E., Mouradi, O., Kauchak, D. & Just, M. L. Improving perceived and actual text difficulty for health information consumers using semi-automated methods. In AMIA Annual Symposium Proceedings , vol. 2012, 522 (American Medical Informatics Association, 2012)

work page 2012
[34]

Chen, J., Zheng, J. & Yu, H. Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations 4, e6373. URL https://medinform.jmir.org/2016/4/e40

work page 2016
[35]

Chen, J. & Yu, H. Unsupervised ensemble ranking of terms in electronic health record notes based on their im- portance to patients 68, 121–131. URL https://www.sciencedirect.com/science/article/pii/S153204641730045X

work page
[36]

Aronson, A. R. Metamap: Mapping text to the umls metathesaurus. Bethesda, MD: NLM, NIH, DHHS 1, 26 (2006)

work page 2006
[37]

& Ammar, W

Neumann, M., King, D., Beltagy, I. & Ammar, W. Scispacy: fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669 (2019)

work page arXiv 1902
[38]

Eyre, H. et al. Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python. AMIA Annu Symp Proc 2021, 438–447 (2021)

work page 2021
[39]

& Goharian, N

Soldaini, L. & Goharian, N. Quickumls: a fast, unsupervised approach for medical concept extraction. In MedIR workshop, sigir, 1–4 (2016)

work page 2016
[40]

Unified Medical Language System® (UMLS®) – Basics

work page
[41]

Tian, S. et al. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics 25, bbad493 (2024)

work page 2024
[42]

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)

work page 2023
[43]

Singhal, K. et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023)

work page internal anchor Pith review arXiv 2023
[44]

Tu, T. et al. Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654 (2024)

work page arXiv 2024
[45]

McDuff, D. et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164 (2023)

work page arXiv 2023
[46]

Wu, C. et al. Pmc-llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association ocae045 (2024)

work page 2024
[47]

Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Tran, H., Yang, Z., Yao, Z. & Yu, H. Bioinstruct: Instruction tuning of large language models for biomedical natural language processing. arXiv preprint arXiv:2310.19975 (2023)

work page arXiv 2023
[49]

Capabilities of GPT-4 on Medical Challenge Problems

Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Kung, T. H. et al. Performance of chatgpt on usmle: potential for ai-assisted medical education using large language models. PLoS digital health 2, e0000198 (2023)

work page 2023
[51]

Yang, L. et al. Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162 (2024)

work page arXiv 2024
[52]

Yang, Z. et al. Performance of multimodal gpt-4v on usmle with image: potential for imaging diagnostic support with explanations. medRxiv 2023–10 (2023)

work page 2023
[53]

Yao, Z. et al. Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework. arXiv preprint arXiv:2410.01553 (2024)

work page arXiv 2024
[54]

Hu, Y . et al. Improving large language models for clinical named entity recognition via prompt engineering. Journal of the American Medical Informatics Association ocad259 (2024)

work page 2024
[55]

Monajatipoor, M. et al. LLMs in Biomedicine: A study on clinical Named Entity Recognition. URL http: //arxiv.org/abs/2404.07376. 2404.07376

work page arXiv
[56]

Hu, D., Liu, B., Zhu, X., Lu, X. & Wu, N. Zero-shot information extraction from radiological reports using chatgpt. International Journal of Medical Informatics 183, 105321 (2024)

work page 2024
[57]

Liu, S., Wang, A., Xiu, X., Zhong, M. & Wu, S. Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study 12, e59782. URL https://medinform.jmir.org/2024/1/e59782

work page 2024
[58]

Bose, P. et al. A Survey on Recent Named Entity Recognition and Relationship Extraction Techniques on Clinical Texts 11, 8319. URL https://www.mdpi.com/2076-3417/11/18/8319

work page 2076
[59]

Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020)

work page 2020
[60]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907
[61]

Yao, Z., Cao, Y ., Yang, Z., Deshpande, V . & Yu, H. Extracting biomedical factual knowledge using pretrained language model and electronic health record context. In AMIA Annual Symposium Proceedings, vol. 2022, 1188 (2023)

work page 2022
[62]

Yao, Z., Cao, Y ., Yang, Z. & Yu, H. Context variance evaluation of pretrained language models for prompt-based biomedical knowledge probing. AMIA Summits on Translational Science Proceedings 2023, 592 (2023)

work page 2023
[63]

Gutierrez, B. J. et al. Thinking about gpt-3 in-context learning for biomedical ie? think again. arXiv preprint arXiv:2203.08410 (2022)

work page arXiv 2022
[64]

& Samwald, M

Moradi, M., Blagec, K., Haberl, F. & Samwald, M. Gpt-3 models are poor few-shot learners in the biomedical domain. arXiv preprint arXiv:2109.02555 (2021)

work page arXiv 2021
[65]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[66]

Alsentzer, E. et al. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[67]

H., Kwon, S., Yao, Z., Lalor, J

Lim, J. H., Kwon, S., Yao, Z., Lalor, J. P. & Yu, H. Large language model-based role-playing for personalized medical jargon extraction. arXiv preprint arXiv:2408.05555 (2024)

work page arXiv 2024
[68]

Openai website

OpenAI. Openai website. URL https://openai.com/

work page
[69]

Ghali, M.-K. et al. Gamedx: Generative ai-based medical entity data extractor using large language models. arXiv preprint arXiv:2405.20585 (2024)

work page arXiv 2024
[70]

Butler, J. J. et al. From jargon to clarity: Improving the readability of foot and ankle radiology reports with an artificial intelligence large language model. Foot and Ankle Surgery30, 331–337 (2024)

work page 2024
[71]

Mannhardt, N. et al. Impact of large language model assistance on patients reading clinical notes: A mixed- methods study. arXiv preprint arXiv:2401.09637 (2024)

work page arXiv 2024
[72]

C., He, Y

Lu, J., Li, J., Wallace, B. C., He, Y . & Pergola, G. Napss: Paragraph-level medical text simplification via narrative prompting and sentence-matching summarization. arXiv preprint arXiv:2302.05574 (2023)

work page arXiv 2023
[73]

Speier, W., Ong, M. K. & Arnold, C. W. Using phrases and document metadata to improve topic modeling of clinical reports 61, 260–266. URL https://www.sciencedirect.com/science/article/pii/S1532046416300284

work page
[74]

Wen, Z. et al. Mining heterogeneous clinical notes by multi-modal latent topic model 16, e0249622 (2021. 4. 8.). URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0249622

work page doi:10.1371/journal.pone.0249622 2021
[75]

Sun, S., Zack, T., Williams, C. Y . K., Sushil, M. & Butte, A. J. Topic modeling on clinical social work notes for exploring social determinants of health factors7, ooad112. URL https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC10788143/. 38223407

work page
[76]

N., Fodeh, S

Chen, J., Jagannatha, A. N., Fodeh, S. J. & Yu, H. Ranking Medical Terms to Support Expansion of Lay Lan- guage Resources for Patient Comprehension of Electronic Health Record Notes: Adapted Distant Supervision Approach 5, e42. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5686421/. 29089288

work page
[77]

Aronson, A. R. & Lang, F.-M. An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association 17, 229–236 (2010)

work page 2010
[78]

Yao, Z. et al. Readme: Bridging medical jargon and lay understanding for patient education through data-centric nlp. arXiv preprint arXiv:2312.15561 (2023)

work page arXiv 2023
[79]

Cai, P. et al. Generation of patient after-visit summaries to support physicians. In Proceedings of the 29th International Conference on Computational Linguistics (COLING) (2022)

work page 2022
[80]

Jiang, A. Q. et al. Mistral 7B. URL http://arxiv.org/abs/2310.06825. 2310.06825

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

Delbanco, T. et al. Open notes: doctors and patients signing on (2010)

work page 2010

[2] [2]

About the blue button movement

HealthIT.gov. About the blue button movement. https://www.healthit.gov/patients-families/ about-blue-button-movement (2024). [accessed 2024-10-29]

work page 2024

[3] [3]

Delbanco, T. et al. Inviting patients to read their doctors’ notes: a quasi-experimental study and a look ahead. Annals of internal medicine 157, 461–470 (2012)

work page 2012

[4] [4]

21st century cures act

Gabay, M. 21st century cures act. Hospital pharmacy 52, 264–265 (2017)

work page 2017

[5] [5]

& Williams, B

Bajwa, J., Munir, U., Nori, A. & Williams, B. Artificial intelligence in healthcare: transforming the practice of medicine. Future healthcare journal 8, e188 (2021)

work page 2021

[6] [6]

T., Forman, H

Lye, C. T., Forman, H. P., Daniel, J. G. & Krumholz, H. M. The 21st century cures act and electronic health records one year later: will patients see the benefits? Journal of the American Medical Informatics Association 25, 1218–1220 (2018)

work page 2018

[7] [7]

Arvisais-Anhalt, S. et al. The 21st century cures act and multiuser electronic health record access: potential pitfalls of information release. Journal of medical Internet research 24, e34085 (2022)

work page 2022

[8] [8]

A., Clark, C

Rodriguez, J. A., Clark, C. R. & Bates, D. W. Digital health equity as a necessity in the 21st century cures act era. Jama 323, 2381–2382 (2020)

work page 2020

[9] [9]

Artificial intelligence and health literacy—proceed with caution

Nutbeam, D. Artificial intelligence and health literacy—proceed with caution. Health Literacy and Communi- cation Open 1, 2263355 (2023)

work page 2023

[10] [10]

Root, J. et al. Characteristics of patients who report confusion after reading their primary care clinic notes online. Health communication 31, 778–781 (2016)

work page 2016

[11] [11]

Kayastha, N., Pollak, K. I. & LeBlanc, T. W. Open oncology notes: a qualitative study of oncology patients’ experiences reading their cancer care notes. Journal of Oncology Practice 14, e251–e258 (2018)

work page 2018

[12] [12]

Kujala, S. et al. Patients’ experiences of web-based access to electronic health records in finland: Cross-sectional survey. Journal of Medical Internet Research 24, e37438 (2022)

work page 2022

[13] [13]

Choudhry, A. J. et al. Readability of discharge summaries: with what level of information are we dismissing our patients? The American Journal of Surgery 211, 631–636 (2016)

work page 2016

[14] [14]

& Mazur, L

Khasawneh, A., Kratzke, I., Adapa, K., Marks, L. & Mazur, L. Effect of notes’ access and complexity on opennotes’ utility. Applied Clinical Informatics 13, 1015–1023 (2022)

work page 2022

[15] [15]

Rahimian, M. et al. Open notes sounds great, but will a provider’s documentation change? an exploratory study of the effect of open notes on oncology documentation. JAMIA open 4, ooab051 (2021)

work page 2021

[16] [16]

Zheng, J. & Yu, H. Readability formulas and user perceptions of electronic health records difficulty: a corpus study. Journal of medical Internet research 19, e59 (2017)

work page 2017

[17] [17]

Zeng-Treitler, Q. et al. Text characteristics of clinical reports and their implications for the readability of personal health records. Studies in health technology and informatics 129, 1117 (2007)

work page 2007

[18] [18]

Polepalli Ramesh, B., Houston, T., Brandt, C., Fang, H. & Yu, H. Improving patients’ electronic health record comprehension with noteaid. In MEDINFO 2013, 714–718 (IOS Press, 2013)

work page 2013

[19] [19]

Sarzynski, E. et al. Opportunities to improve clinical summaries for patients at hospital discharge. BMJ quality & safety 26, 372–380 (2017)

work page 2017

[20] [20]

C., Doak, L

Doak, C. C., Doak, L. G. & Root, J. H. Teaching patients with low literacy skills. AJN The American Journal of Nursing 96, 16M (1996)

work page 1996

[21] [21]

C., Doak, L

Doak, C. C., Doak, L. G., Friedell, G. H. & Meade, C. D. Improving comprehension for cancer patients with low literacy skills: strategies for clinicians. CA: A Cancer Journal for Clinicians 48, 151–162 (1998)

work page 1998

[22] [22]

Walsh, T. M. & V olsko, T. A. Readability assessment of internet-based consumer health information.Respiratory care 53, 1310–1315 (2008)

work page 2008

[23] [23]

E., Han, A., Truntzer, J

Eltorai, A. E., Han, A., Truntzer, J. & Daniels, A. H. Readability of patient education materials on the american orthopaedic society for sports medicine website. The Physician and Sportsmedicine 42, 125–130 (2014)

work page 2014

[24] [24]

J., Jansen, J

Morony, S., Flynn, M., McCaffery, K. J., Jansen, J. & Webster, A. C. Readability of written materials for ckd patients: a systematic review. American Journal of Kidney Diseases 65, 842–850 (2015)

work page 2015

[25] [25]

B., Farach, F

Johnson, S. B., Farach, F. J., Pelphrey, K. & Rozenblit, L. Data management in clinical research: synthesizing stakeholder perspectives. Journal of biomedical informatics 60, 286–293 (2016)

work page 2016

[26] [26]

A., Fiszman, M., Raja, K., Jonnalagadda, S

Morid, M. A., Fiszman, M., Raja, K., Jonnalagadda, S. R. & Del Fiol, G. Classification of clinically useful sentences in clinical evidence resources. Journal of biomedical informatics 60, 14–22 (2016)

work page 2016

[27] [27]

& Zeng-Treitler, Q

Kandula, S., Curtis, D. & Zeng-Treitler, Q. A semantic and syntactic text simplification tool for health content. In AMIA annual symposium proceedings, vol. 2010, 366 (American Medical Informatics Association, 2010)

work page 2010

[28] [28]

& Rosendale, D

Zeng-Treitler, Q., Goryachev, S., Kim, H., Keselman, A. & Rosendale, D. Making texts in electronic health records comprehensible to consumers: a prototype translator. In AMIA Annual Symposium Proceedings , vol. 2007, 846 (American Medical Informatics Association, 2007)

work page 2007

[29] [29]

& Kvist, M

Abrahamsson, E., Forni, T., Skeppstedt, M. & Kvist, M. Medical text simplification using synonym replacement: Adapting assessment of word difficulty to a compounding language. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), 57–65 (2014)

work page 2014

[30] [30]

Zheng, J. & Yu, H. Methods for linking ehr notes to education materials. Information Retrieval Journal 19, 174–188 (2016)

work page 2016

[31] [31]

Chen, J. et al. A natural language processing system that links medical terms in electronic health record notes to lay definitions: system development using physician reviews. Journal of medical Internet research 20, e26 (2018)

work page 2018

[32] [32]

Kwon, S. et al. Medjex: A medical jargon extraction model with wiki’s hyperlink span and contextualized masked language model score. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2022, 11733 (NIH Public Access, 2022)

work page 2022

[33] [33]

E., Mouradi, O., Kauchak, D

Leroy, G., Endicott, J. E., Mouradi, O., Kauchak, D. & Just, M. L. Improving perceived and actual text difficulty for health information consumers using semi-automated methods. In AMIA Annual Symposium Proceedings , vol. 2012, 522 (American Medical Informatics Association, 2012)

work page 2012

[34] [34]

Chen, J., Zheng, J. & Yu, H. Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations 4, e6373. URL https://medinform.jmir.org/2016/4/e40

work page 2016

[35] [35]

Chen, J. & Yu, H. Unsupervised ensemble ranking of terms in electronic health record notes based on their im- portance to patients 68, 121–131. URL https://www.sciencedirect.com/science/article/pii/S153204641730045X

work page

[36] [36]

Aronson, A. R. Metamap: Mapping text to the umls metathesaurus. Bethesda, MD: NLM, NIH, DHHS 1, 26 (2006)

work page 2006

[37] [37]

& Ammar, W

Neumann, M., King, D., Beltagy, I. & Ammar, W. Scispacy: fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669 (2019)

work page arXiv 1902

[38] [38]

Eyre, H. et al. Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python. AMIA Annu Symp Proc 2021, 438–447 (2021)

work page 2021

[39] [39]

& Goharian, N

Soldaini, L. & Goharian, N. Quickumls: a fast, unsupervised approach for medical concept extraction. In MedIR workshop, sigir, 1–4 (2016)

work page 2016

[40] [40]

Unified Medical Language System® (UMLS®) – Basics

work page

[41] [41]

Tian, S. et al. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics 25, bbad493 (2024)

work page 2024

[42] [42]

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)

work page 2023

[43] [43]

Singhal, K. et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023)

work page internal anchor Pith review arXiv 2023

[44] [44]

Tu, T. et al. Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654 (2024)

work page arXiv 2024

[45] [45]

McDuff, D. et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164 (2023)

work page arXiv 2023

[46] [46]

Wu, C. et al. Pmc-llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association ocae045 (2024)

work page 2024

[47] [47]

Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Tran, H., Yang, Z., Yao, Z. & Yu, H. Bioinstruct: Instruction tuning of large language models for biomedical natural language processing. arXiv preprint arXiv:2310.19975 (2023)

work page arXiv 2023

[49] [49]

Capabilities of GPT-4 on Medical Challenge Problems

Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Kung, T. H. et al. Performance of chatgpt on usmle: potential for ai-assisted medical education using large language models. PLoS digital health 2, e0000198 (2023)

work page 2023

[51] [51]

Yang, L. et al. Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162 (2024)

work page arXiv 2024

[52] [52]

Yang, Z. et al. Performance of multimodal gpt-4v on usmle with image: potential for imaging diagnostic support with explanations. medRxiv 2023–10 (2023)

work page 2023

[53] [53]

Yao, Z. et al. Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework. arXiv preprint arXiv:2410.01553 (2024)

work page arXiv 2024

[54] [54]

Hu, Y . et al. Improving large language models for clinical named entity recognition via prompt engineering. Journal of the American Medical Informatics Association ocad259 (2024)

work page 2024

[55] [55]

Monajatipoor, M. et al. LLMs in Biomedicine: A study on clinical Named Entity Recognition. URL http: //arxiv.org/abs/2404.07376. 2404.07376

work page arXiv

[56] [56]

Hu, D., Liu, B., Zhu, X., Lu, X. & Wu, N. Zero-shot information extraction from radiological reports using chatgpt. International Journal of Medical Informatics 183, 105321 (2024)

work page 2024

[57] [57]

Liu, S., Wang, A., Xiu, X., Zhong, M. & Wu, S. Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study 12, e59782. URL https://medinform.jmir.org/2024/1/e59782

work page 2024

[58] [58]

Bose, P. et al. A Survey on Recent Named Entity Recognition and Relationship Extraction Techniques on Clinical Texts 11, 8319. URL https://www.mdpi.com/2076-3417/11/18/8319

work page 2076

[59] [59]

Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020)

work page 2020

[60] [60]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907

[61] [61]

Yao, Z., Cao, Y ., Yang, Z., Deshpande, V . & Yu, H. Extracting biomedical factual knowledge using pretrained language model and electronic health record context. In AMIA Annual Symposium Proceedings, vol. 2022, 1188 (2023)

work page 2022

[62] [62]

Yao, Z., Cao, Y ., Yang, Z. & Yu, H. Context variance evaluation of pretrained language models for prompt-based biomedical knowledge probing. AMIA Summits on Translational Science Proceedings 2023, 592 (2023)

work page 2023

[63] [63]

Gutierrez, B. J. et al. Thinking about gpt-3 in-context learning for biomedical ie? think again. arXiv preprint arXiv:2203.08410 (2022)

work page arXiv 2022

[64] [64]

& Samwald, M

Moradi, M., Blagec, K., Haberl, F. & Samwald, M. Gpt-3 models are poor few-shot learners in the biomedical domain. arXiv preprint arXiv:2109.02555 (2021)

work page arXiv 2021

[65] [65]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[66] [66]

Alsentzer, E. et al. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904

[67] [67]

H., Kwon, S., Yao, Z., Lalor, J

Lim, J. H., Kwon, S., Yao, Z., Lalor, J. P. & Yu, H. Large language model-based role-playing for personalized medical jargon extraction. arXiv preprint arXiv:2408.05555 (2024)

work page arXiv 2024

[68] [68]

Openai website

OpenAI. Openai website. URL https://openai.com/

work page

[69] [69]

Ghali, M.-K. et al. Gamedx: Generative ai-based medical entity data extractor using large language models. arXiv preprint arXiv:2405.20585 (2024)

work page arXiv 2024

[70] [70]

Butler, J. J. et al. From jargon to clarity: Improving the readability of foot and ankle radiology reports with an artificial intelligence large language model. Foot and Ankle Surgery30, 331–337 (2024)

work page 2024

[71] [71]

Mannhardt, N. et al. Impact of large language model assistance on patients reading clinical notes: A mixed- methods study. arXiv preprint arXiv:2401.09637 (2024)

work page arXiv 2024

[72] [72]

C., He, Y

Lu, J., Li, J., Wallace, B. C., He, Y . & Pergola, G. Napss: Paragraph-level medical text simplification via narrative prompting and sentence-matching summarization. arXiv preprint arXiv:2302.05574 (2023)

work page arXiv 2023

[73] [73]

Speier, W., Ong, M. K. & Arnold, C. W. Using phrases and document metadata to improve topic modeling of clinical reports 61, 260–266. URL https://www.sciencedirect.com/science/article/pii/S1532046416300284

work page

[74] [74]

Wen, Z. et al. Mining heterogeneous clinical notes by multi-modal latent topic model 16, e0249622 (2021. 4. 8.). URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0249622

work page doi:10.1371/journal.pone.0249622 2021

[75] [75]

Sun, S., Zack, T., Williams, C. Y . K., Sushil, M. & Butte, A. J. Topic modeling on clinical social work notes for exploring social determinants of health factors7, ooad112. URL https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC10788143/. 38223407

work page

[76] [76]

N., Fodeh, S

Chen, J., Jagannatha, A. N., Fodeh, S. J. & Yu, H. Ranking Medical Terms to Support Expansion of Lay Lan- guage Resources for Patient Comprehension of Electronic Health Record Notes: Adapted Distant Supervision Approach 5, e42. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5686421/. 29089288

work page

[77] [77]

Aronson, A. R. & Lang, F.-M. An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association 17, 229–236 (2010)

work page 2010

[78] [78]

Yao, Z. et al. Readme: Bridging medical jargon and lay understanding for patient education through data-centric nlp. arXiv preprint arXiv:2312.15561 (2023)

work page arXiv 2023

[79] [79]

Cai, P. et al. Generation of patient after-visit summaries to support physicians. In Proceedings of the 29th International Conference on Computational Linguistics (COLING) (2022)

work page 2022

[80] [80]

Jiang, A. Q. et al. Mistral 7B. URL http://arxiv.org/abs/2310.06825. 2310.06825

work page internal anchor Pith review Pith/arXiv arXiv