Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation
Pith reviewed 2026-05-23 01:52 UTC · model grok-4.3
The pith
Fine-tuning and data augmentation let open-source LLMs outperform closed-source models at extracting and ranking medical jargon in EHR notes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments on 106 expert-annotated EHR notes show that fine-tuning and data augmentation improve LLM performance for extracting and prioritizing medical jargon, with GPT-4 Turbo reaching the highest F1 score of 0.433 and Mistral7B with augmentation achieving the highest MRR of 0.746; open-source models enhanced this way surpass closed-source models. Few-shot prompting outperforms zero-shot in vanilla models, structured prompts yield different preferences across models, and fine-tuning improves zero-shot performance but sometimes degrades few-shot performance. Data augmentation performs comparably or better than other methods.
What carries the argument
Data augmentation generated by ChatGPT to expand training sets from 10 to 10,000 samples, paired with fine-tuning and ranking techniques evaluated via 5-fold cross-validation on F1 score and mean reciprocal rank.
Load-bearing premise
The 106 expert-annotated EHR notes form a sufficient and representative sample for measuring how well the models identify and prioritize medical jargon.
What would settle it
Re-evaluating the same models and methods on a new collection of several hundred EHR notes drawn from different hospitals or regions and finding that the reported performance ordering of strategies reverses or the absolute scores fall below those of plain prompting.
Figures
read the original abstract
OpenNotes enables patients to access EHR notes, but medical jargon can hinder comprehension. To improve understanding, we evaluated closed- and open-source LLMs for extracting and prioritizing key medical terms using prompting, fine-tuning, and data augmentation. We assessed LLMs on 106 expert-annotated EHR notes, experimenting with (i) general vs. structured prompts, (ii) zero-shot vs. few-shot prompting, (iii) fine-tuning, and (iv) data augmentation. To enhance open-source models in low-resource settings, we used ChatGPT for data augmentation and applied ranking techniques. We incrementally increased the augmented dataset size (10 to 10,000) and conducted 5-fold cross-validation, reporting F1 score and Mean Reciprocal Rank (MRR). Our result show that fine-tuning and data augmentation improved performance over other strategies. GPT-4 Turbo achieved the highest F1 (0.433), while Mistral7B with data augmentation had the highest MRR (0.746). Open-source models, when fine-tuned or augmented, outperformed closed-source models. Notably, the best F1 and MRR scores did not always align. Few-shot prompting outperformed zero-shot in vanilla models, and structured prompts yielded different preferences across models. Fine-tuning improved zero-shot performance but sometimes degraded few-shot performance. Data augmentation performed comparably or better than other methods. Our evaluation highlights the effectiveness of prompting, fine-tuning, and data augmentation in improving model performance for medical jargon extraction in low-resource scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates closed- and open-source LLMs for extracting and prioritizing medical jargon from EHR notes. Experiments on 106 expert-annotated notes compare general vs. structured prompts, zero- vs. few-shot prompting, fine-tuning, and ChatGPT-based data augmentation (scaled from 10 to 10k examples). 5-fold cross-validation is used to report F1 and MRR; the authors conclude that fine-tuning and augmentation improve results over prompting alone, GPT-4 Turbo reaches the highest F1 (0.433), Mistral-7B with augmentation reaches the highest MRR (0.746), and augmented open-source models can outperform closed-source ones.
Significance. If the performance gains hold on larger, more diverse EHR corpora, the work could provide practical guidance for low-resource medical-jargon extraction pipelines that combine prompting, fine-tuning, and synthetic data, potentially aiding patient comprehension of OpenNotes.
major comments (3)
- [Abstract / Methods] The entire evaluation rests on 5-fold CV over only 106 notes (Abstract and Methods). No information is given on sampling procedure, fraction of the source EHR corpus represented, medical sub-domain coverage, or inter-annotator agreement; because all augmented data (up to 10k examples) is generated from this same seed, any selection bias is amplified rather than mitigated.
- [Abstract / Results] No baseline systems (rule-based term extractors, standard biomedical NER models, or simpler ranking methods) are reported, nor are statistical significance tests or confidence intervals provided for the F1/MRR differences (Abstract and Results). This makes it impossible to determine whether the claimed improvements over prompting are reliable.
- [Results] The claim that data augmentation “performed comparably or better” and that open-source models “outperformed closed-source models” when augmented is load-bearing for the paper’s contribution, yet the evaluation provides no external validation set or out-of-distribution test to separate modeling effects from idiosyncrasies of the 106-note distribution.
minor comments (2)
- [Abstract] Typo: “Our result show” should read “Our results show.”
- [Abstract] The abstract states that “fine-tuning improved zero-shot performance but sometimes degraded few-shot performance,” yet no quantitative breakdown or table is referenced to support this observation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment point by point below, indicating where revisions to the manuscript are planned.
read point-by-point responses
-
Referee: [Abstract / Methods] The entire evaluation rests on 5-fold CV over only 106 notes (Abstract and Methods). No information is given on sampling procedure, fraction of the source EHR corpus represented, medical sub-domain coverage, or inter-annotator agreement; because all augmented data (up to 10k examples) is generated from this same seed, any selection bias is amplified rather than mitigated.
Authors: We agree that the manuscript would benefit from additional dataset details. The 106 notes were randomly sampled from EHR notes at a single academic medical center and cover multiple clinical sub-domains, but the exact fraction of the source corpus and inter-annotator agreement statistics were not reported. In the revised version we will add a dedicated dataset subsection describing the sampling procedure, corpus fraction (where known), sub-domain coverage, and any available inter-annotator agreement figures. We will also expand the discussion to acknowledge that augmentation from the same seed can amplify selection bias and note this as a limitation of the current low-resource setting. revision: partial
-
Referee: [Abstract / Results] No baseline systems (rule-based term extractors, standard biomedical NER models, or simpler ranking methods) are reported, nor are statistical significance tests or confidence intervals provided for the F1/MRR differences (Abstract and Results). This makes it impossible to determine whether the claimed improvements over prompting are reliable.
Authors: This observation is correct. The revised manuscript will include comparisons against at least two baselines: a rule-based medical term extractor using UMLS and a standard biomedical NER model (BioBERT). We will also add statistical significance testing (paired t-tests across the five folds) and 95% confidence intervals for all reported F1 and MRR differences to allow readers to assess the reliability of the observed gains. revision: yes
-
Referee: [Results] The claim that data augmentation “performed comparably or better” and that open-source models “outperformed closed-source models” when augmented is load-bearing for the paper’s contribution, yet the evaluation provides no external validation set or out-of-distribution test to separate modeling effects from idiosyncrasies of the 106-note distribution.
Authors: We acknowledge that the lack of an external or out-of-distribution test set is a genuine limitation. The study was designed around a low-resource scenario with only 106 expert-annotated notes; an external validation set was not available. The 5-fold cross-validation therefore serves as the primary internal evaluation. In the revised discussion we will explicitly qualify the claims by stating that the reported improvements are observed within this distribution and recommend future validation on larger, multi-institutional corpora. No new external data will be added at this time. revision: partial
Circularity Check
No circularity: purely empirical evaluation with standard CV
full rationale
The paper reports results from an empirical evaluation of LLMs on a fixed set of 106 expert-annotated EHR notes. It compares prompting strategies, fine-tuning, and data augmentation (generated via ChatGPT from the seed notes) using 5-fold cross-validation to compute F1 and MRR. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential claims exist. All reported metrics are computed directly from model outputs on held-out folds; the augmentation process does not create a definitional loop because test performance is measured on unseen notes. No load-bearing self-citations or uniqueness theorems are invoked. The central claims rest on observable experimental outcomes rather than reducing to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotations on the 106 EHR notes constitute reliable ground truth for what counts as important medical jargon.
Reference graph
Works this paper leans on
-
[1]
Delbanco, T. et al. Open notes: doctors and patients signing on (2010)
work page 2010
-
[2]
About the blue button movement
HealthIT.gov. About the blue button movement. https://www.healthit.gov/patients-families/ about-blue-button-movement (2024). [accessed 2024-10-29]
work page 2024
-
[3]
Delbanco, T. et al. Inviting patients to read their doctors’ notes: a quasi-experimental study and a look ahead. Annals of internal medicine 157, 461–470 (2012)
work page 2012
-
[4]
Gabay, M. 21st century cures act. Hospital pharmacy 52, 264–265 (2017)
work page 2017
-
[5]
Bajwa, J., Munir, U., Nori, A. & Williams, B. Artificial intelligence in healthcare: transforming the practice of medicine. Future healthcare journal 8, e188 (2021)
work page 2021
-
[6]
Lye, C. T., Forman, H. P., Daniel, J. G. & Krumholz, H. M. The 21st century cures act and electronic health records one year later: will patients see the benefits? Journal of the American Medical Informatics Association 25, 1218–1220 (2018)
work page 2018
-
[7]
Arvisais-Anhalt, S. et al. The 21st century cures act and multiuser electronic health record access: potential pitfalls of information release. Journal of medical Internet research 24, e34085 (2022)
work page 2022
-
[8]
Rodriguez, J. A., Clark, C. R. & Bates, D. W. Digital health equity as a necessity in the 21st century cures act era. Jama 323, 2381–2382 (2020)
work page 2020
-
[9]
Artificial intelligence and health literacy—proceed with caution
Nutbeam, D. Artificial intelligence and health literacy—proceed with caution. Health Literacy and Communi- cation Open 1, 2263355 (2023)
work page 2023
-
[10]
Root, J. et al. Characteristics of patients who report confusion after reading their primary care clinic notes online. Health communication 31, 778–781 (2016)
work page 2016
-
[11]
Kayastha, N., Pollak, K. I. & LeBlanc, T. W. Open oncology notes: a qualitative study of oncology patients’ experiences reading their cancer care notes. Journal of Oncology Practice 14, e251–e258 (2018)
work page 2018
-
[12]
Kujala, S. et al. Patients’ experiences of web-based access to electronic health records in finland: Cross-sectional survey. Journal of Medical Internet Research 24, e37438 (2022)
work page 2022
-
[13]
Choudhry, A. J. et al. Readability of discharge summaries: with what level of information are we dismissing our patients? The American Journal of Surgery 211, 631–636 (2016)
work page 2016
-
[14]
Khasawneh, A., Kratzke, I., Adapa, K., Marks, L. & Mazur, L. Effect of notes’ access and complexity on opennotes’ utility. Applied Clinical Informatics 13, 1015–1023 (2022)
work page 2022
-
[15]
Rahimian, M. et al. Open notes sounds great, but will a provider’s documentation change? an exploratory study of the effect of open notes on oncology documentation. JAMIA open 4, ooab051 (2021)
work page 2021
-
[16]
Zheng, J. & Yu, H. Readability formulas and user perceptions of electronic health records difficulty: a corpus study. Journal of medical Internet research 19, e59 (2017)
work page 2017
-
[17]
Zeng-Treitler, Q. et al. Text characteristics of clinical reports and their implications for the readability of personal health records. Studies in health technology and informatics 129, 1117 (2007)
work page 2007
-
[18]
Polepalli Ramesh, B., Houston, T., Brandt, C., Fang, H. & Yu, H. Improving patients’ electronic health record comprehension with noteaid. In MEDINFO 2013, 714–718 (IOS Press, 2013)
work page 2013
-
[19]
Sarzynski, E. et al. Opportunities to improve clinical summaries for patients at hospital discharge. BMJ quality & safety 26, 372–380 (2017)
work page 2017
-
[20]
Doak, C. C., Doak, L. G. & Root, J. H. Teaching patients with low literacy skills. AJN The American Journal of Nursing 96, 16M (1996)
work page 1996
-
[21]
Doak, C. C., Doak, L. G., Friedell, G. H. & Meade, C. D. Improving comprehension for cancer patients with low literacy skills: strategies for clinicians. CA: A Cancer Journal for Clinicians 48, 151–162 (1998)
work page 1998
-
[22]
Walsh, T. M. & V olsko, T. A. Readability assessment of internet-based consumer health information.Respiratory care 53, 1310–1315 (2008)
work page 2008
-
[23]
Eltorai, A. E., Han, A., Truntzer, J. & Daniels, A. H. Readability of patient education materials on the american orthopaedic society for sports medicine website. The Physician and Sportsmedicine 42, 125–130 (2014)
work page 2014
-
[24]
Morony, S., Flynn, M., McCaffery, K. J., Jansen, J. & Webster, A. C. Readability of written materials for ckd patients: a systematic review. American Journal of Kidney Diseases 65, 842–850 (2015)
work page 2015
-
[25]
Johnson, S. B., Farach, F. J., Pelphrey, K. & Rozenblit, L. Data management in clinical research: synthesizing stakeholder perspectives. Journal of biomedical informatics 60, 286–293 (2016)
work page 2016
-
[26]
A., Fiszman, M., Raja, K., Jonnalagadda, S
Morid, M. A., Fiszman, M., Raja, K., Jonnalagadda, S. R. & Del Fiol, G. Classification of clinically useful sentences in clinical evidence resources. Journal of biomedical informatics 60, 14–22 (2016)
work page 2016
-
[27]
Kandula, S., Curtis, D. & Zeng-Treitler, Q. A semantic and syntactic text simplification tool for health content. In AMIA annual symposium proceedings, vol. 2010, 366 (American Medical Informatics Association, 2010)
work page 2010
-
[28]
Zeng-Treitler, Q., Goryachev, S., Kim, H., Keselman, A. & Rosendale, D. Making texts in electronic health records comprehensible to consumers: a prototype translator. In AMIA Annual Symposium Proceedings , vol. 2007, 846 (American Medical Informatics Association, 2007)
work page 2007
-
[29]
Abrahamsson, E., Forni, T., Skeppstedt, M. & Kvist, M. Medical text simplification using synonym replacement: Adapting assessment of word difficulty to a compounding language. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), 57–65 (2014)
work page 2014
-
[30]
Zheng, J. & Yu, H. Methods for linking ehr notes to education materials. Information Retrieval Journal 19, 174–188 (2016)
work page 2016
-
[31]
Chen, J. et al. A natural language processing system that links medical terms in electronic health record notes to lay definitions: system development using physician reviews. Journal of medical Internet research 20, e26 (2018)
work page 2018
-
[32]
Kwon, S. et al. Medjex: A medical jargon extraction model with wiki’s hyperlink span and contextualized masked language model score. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2022, 11733 (NIH Public Access, 2022)
work page 2022
-
[33]
Leroy, G., Endicott, J. E., Mouradi, O., Kauchak, D. & Just, M. L. Improving perceived and actual text difficulty for health information consumers using semi-automated methods. In AMIA Annual Symposium Proceedings , vol. 2012, 522 (American Medical Informatics Association, 2012)
work page 2012
-
[34]
Chen, J., Zheng, J. & Yu, H. Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations 4, e6373. URL https://medinform.jmir.org/2016/4/e40
work page 2016
-
[35]
Chen, J. & Yu, H. Unsupervised ensemble ranking of terms in electronic health record notes based on their im- portance to patients 68, 121–131. URL https://www.sciencedirect.com/science/article/pii/S153204641730045X
-
[36]
Aronson, A. R. Metamap: Mapping text to the umls metathesaurus. Bethesda, MD: NLM, NIH, DHHS 1, 26 (2006)
work page 2006
-
[37]
Neumann, M., King, D., Beltagy, I. & Ammar, W. Scispacy: fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669 (2019)
-
[38]
Eyre, H. et al. Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python. AMIA Annu Symp Proc 2021, 438–447 (2021)
work page 2021
-
[39]
Soldaini, L. & Goharian, N. Quickumls: a fast, unsupervised approach for medical concept extraction. In MedIR workshop, sigir, 1–4 (2016)
work page 2016
-
[40]
Unified Medical Language System® (UMLS®) – Basics
-
[41]
Tian, S. et al. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics 25, bbad493 (2024)
work page 2024
-
[42]
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)
work page 2023
-
[43]
Singhal, K. et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023)
work page internal anchor Pith review arXiv 2023
- [44]
- [45]
-
[46]
Wu, C. et al. Pmc-llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association ocae045 (2024)
work page 2024
-
[47]
Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [48]
-
[49]
Capabilities of GPT-4 on Medical Challenge Problems
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Kung, T. H. et al. Performance of chatgpt on usmle: potential for ai-assisted medical education using large language models. PLoS digital health 2, e0000198 (2023)
work page 2023
- [51]
-
[52]
Yang, Z. et al. Performance of multimodal gpt-4v on usmle with image: potential for imaging diagnostic support with explanations. medRxiv 2023–10 (2023)
work page 2023
- [53]
-
[54]
Hu, Y . et al. Improving large language models for clinical named entity recognition via prompt engineering. Journal of the American Medical Informatics Association ocad259 (2024)
work page 2024
- [55]
-
[56]
Hu, D., Liu, B., Zhu, X., Lu, X. & Wu, N. Zero-shot information extraction from radiological reports using chatgpt. International Journal of Medical Informatics 183, 105321 (2024)
work page 2024
-
[57]
Liu, S., Wang, A., Xiu, X., Zhong, M. & Wu, S. Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study 12, e59782. URL https://medinform.jmir.org/2024/1/e59782
work page 2024
-
[58]
Bose, P. et al. A Survey on Recent Named Entity Recognition and Relationship Extraction Techniques on Clinical Texts 11, 8319. URL https://www.mdpi.com/2076-3417/11/18/8319
work page 2076
-
[59]
Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020)
work page 2020
-
[60]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu, Y . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[61]
Yao, Z., Cao, Y ., Yang, Z., Deshpande, V . & Yu, H. Extracting biomedical factual knowledge using pretrained language model and electronic health record context. In AMIA Annual Symposium Proceedings, vol. 2022, 1188 (2023)
work page 2022
-
[62]
Yao, Z., Cao, Y ., Yang, Z. & Yu, H. Context variance evaluation of pretrained language models for prompt-based biomedical knowledge probing. AMIA Summits on Translational Science Proceedings 2023, 592 (2023)
work page 2023
- [63]
-
[64]
Moradi, M., Blagec, K., Haberl, F. & Samwald, M. Gpt-3 models are poor few-shot learners in the biomedical domain. arXiv preprint arXiv:2109.02555 (2021)
-
[65]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[66]
Alsentzer, E. et al. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[67]
H., Kwon, S., Yao, Z., Lalor, J
Lim, J. H., Kwon, S., Yao, Z., Lalor, J. P. & Yu, H. Large language model-based role-playing for personalized medical jargon extraction. arXiv preprint arXiv:2408.05555 (2024)
- [68]
- [69]
-
[70]
Butler, J. J. et al. From jargon to clarity: Improving the readability of foot and ankle radiology reports with an artificial intelligence large language model. Foot and Ankle Surgery30, 331–337 (2024)
work page 2024
- [71]
- [72]
-
[73]
Speier, W., Ong, M. K. & Arnold, C. W. Using phrases and document metadata to improve topic modeling of clinical reports 61, 260–266. URL https://www.sciencedirect.com/science/article/pii/S1532046416300284
-
[74]
Wen, Z. et al. Mining heterogeneous clinical notes by multi-modal latent topic model 16, e0249622 (2021. 4. 8.). URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0249622
-
[75]
Sun, S., Zack, T., Williams, C. Y . K., Sushil, M. & Butte, A. J. Topic modeling on clinical social work notes for exploring social determinants of health factors7, ooad112. URL https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC10788143/. 38223407
-
[76]
Chen, J., Jagannatha, A. N., Fodeh, S. J. & Yu, H. Ranking Medical Terms to Support Expansion of Lay Lan- guage Resources for Patient Comprehension of Electronic Health Record Notes: Adapted Distant Supervision Approach 5, e42. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5686421/. 29089288
-
[77]
Aronson, A. R. & Lang, F.-M. An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association 17, 229–236 (2010)
work page 2010
- [78]
-
[79]
Cai, P. et al. Generation of patient after-visit summaries to support physicians. In Proceedings of the 29th International Conference on Computational Linguistics (COLING) (2022)
work page 2022
-
[80]
Jiang, A. Q. et al. Mistral 7B. URL http://arxiv.org/abs/2310.06825. 2310.06825
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.