pith. sign in

arxiv: 2502.16022 · v2 · submitted 2025-02-22 · 💻 cs.CL

Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

Pith reviewed 2026-05-23 01:52 UTC · model grok-4.3

classification 💻 cs.CL
keywords medical jargon extractionLLM fine-tuningdata augmentationEHR notesterm prioritizationOpenNotesF1 scoremean reciprocal rank
0
0 comments X

The pith

Fine-tuning and data augmentation let open-source LLMs outperform closed-source models at extracting and ranking medical jargon in EHR notes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors test closed-source and open-source large language models on the task of finding and ranking the most important medical terms in electronic health record notes. They compare basic prompting, few-shot examples, structured instructions, fine-tuning the models, and using ChatGPT to create extra training data. The key result is that fine-tuning and data augmentation give the biggest gains, with open-source models that receive this treatment beating closed-source models on the ranking metric. This matters for making medical notes more readable for patients who access them through OpenNotes.

Core claim

Experiments on 106 expert-annotated EHR notes show that fine-tuning and data augmentation improve LLM performance for extracting and prioritizing medical jargon, with GPT-4 Turbo reaching the highest F1 score of 0.433 and Mistral7B with augmentation achieving the highest MRR of 0.746; open-source models enhanced this way surpass closed-source models. Few-shot prompting outperforms zero-shot in vanilla models, structured prompts yield different preferences across models, and fine-tuning improves zero-shot performance but sometimes degrades few-shot performance. Data augmentation performs comparably or better than other methods.

What carries the argument

Data augmentation generated by ChatGPT to expand training sets from 10 to 10,000 samples, paired with fine-tuning and ranking techniques evaluated via 5-fold cross-validation on F1 score and mean reciprocal rank.

Load-bearing premise

The 106 expert-annotated EHR notes form a sufficient and representative sample for measuring how well the models identify and prioritize medical jargon.

What would settle it

Re-evaluating the same models and methods on a new collection of several hundred EHR notes drawn from different hospitals or regions and finding that the reported performance ordering of strategies reverses or the absolute scores fall below those of plain prompting.

Figures

Figures reproduced from arXiv: 2502.16022 by Hieu Tran, Hong Yu, Sharmin Sultana, Sunjae Kwon, Won Seok Jang, Zhichao Yang, Zonghai Yao.

Figure 1
Figure 1. Figure 1: The evaluation workflow for closed and open-Source LLMs. We evaluate the performance of the LLMs in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A sample EHR note where physicians identified important medical terms. Diagnoses/conditions are high [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Case Study for Extracting the Top 3 Important Medical Jargons from Zero-shot and Few-shot Prompts in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case Study for extracting Top 5 important medical jargons from BioMistral7B and BioMistral7B that was [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case Study for extracting Top 5 important medical jargons from Llama3.1 8B finetuned and Llama 3.1 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Structured Prompt ### Instruction: You are a helpful assistant, an expert in medical domain. Extract top 3 key terms mentioned in the medical note that are important for the patient. If you think they are of same importance, they can have the same ranking. Do not write no symptoms, or any indication that there is no other diagnosis/symptoms or conditions. Do not modify or abbreviate what is written in the … view at source ↗
Figure 7
Figure 7. Figure 7: General Prompt [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt used for querying GPT-3.5 Turbo for data augmentation [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
read the original abstract

OpenNotes enables patients to access EHR notes, but medical jargon can hinder comprehension. To improve understanding, we evaluated closed- and open-source LLMs for extracting and prioritizing key medical terms using prompting, fine-tuning, and data augmentation. We assessed LLMs on 106 expert-annotated EHR notes, experimenting with (i) general vs. structured prompts, (ii) zero-shot vs. few-shot prompting, (iii) fine-tuning, and (iv) data augmentation. To enhance open-source models in low-resource settings, we used ChatGPT for data augmentation and applied ranking techniques. We incrementally increased the augmented dataset size (10 to 10,000) and conducted 5-fold cross-validation, reporting F1 score and Mean Reciprocal Rank (MRR). Our result show that fine-tuning and data augmentation improved performance over other strategies. GPT-4 Turbo achieved the highest F1 (0.433), while Mistral7B with data augmentation had the highest MRR (0.746). Open-source models, when fine-tuned or augmented, outperformed closed-source models. Notably, the best F1 and MRR scores did not always align. Few-shot prompting outperformed zero-shot in vanilla models, and structured prompts yielded different preferences across models. Fine-tuning improved zero-shot performance but sometimes degraded few-shot performance. Data augmentation performed comparably or better than other methods. Our evaluation highlights the effectiveness of prompting, fine-tuning, and data augmentation in improving model performance for medical jargon extraction in low-resource scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates closed- and open-source LLMs for extracting and prioritizing medical jargon from EHR notes. Experiments on 106 expert-annotated notes compare general vs. structured prompts, zero- vs. few-shot prompting, fine-tuning, and ChatGPT-based data augmentation (scaled from 10 to 10k examples). 5-fold cross-validation is used to report F1 and MRR; the authors conclude that fine-tuning and augmentation improve results over prompting alone, GPT-4 Turbo reaches the highest F1 (0.433), Mistral-7B with augmentation reaches the highest MRR (0.746), and augmented open-source models can outperform closed-source ones.

Significance. If the performance gains hold on larger, more diverse EHR corpora, the work could provide practical guidance for low-resource medical-jargon extraction pipelines that combine prompting, fine-tuning, and synthetic data, potentially aiding patient comprehension of OpenNotes.

major comments (3)
  1. [Abstract / Methods] The entire evaluation rests on 5-fold CV over only 106 notes (Abstract and Methods). No information is given on sampling procedure, fraction of the source EHR corpus represented, medical sub-domain coverage, or inter-annotator agreement; because all augmented data (up to 10k examples) is generated from this same seed, any selection bias is amplified rather than mitigated.
  2. [Abstract / Results] No baseline systems (rule-based term extractors, standard biomedical NER models, or simpler ranking methods) are reported, nor are statistical significance tests or confidence intervals provided for the F1/MRR differences (Abstract and Results). This makes it impossible to determine whether the claimed improvements over prompting are reliable.
  3. [Results] The claim that data augmentation “performed comparably or better” and that open-source models “outperformed closed-source models” when augmented is load-bearing for the paper’s contribution, yet the evaluation provides no external validation set or out-of-distribution test to separate modeling effects from idiosyncrasies of the 106-note distribution.
minor comments (2)
  1. [Abstract] Typo: “Our result show” should read “Our results show.”
  2. [Abstract] The abstract states that “fine-tuning improved zero-shot performance but sometimes degraded few-shot performance,” yet no quantitative breakdown or table is referenced to support this observation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment point by point below, indicating where revisions to the manuscript are planned.

read point-by-point responses
  1. Referee: [Abstract / Methods] The entire evaluation rests on 5-fold CV over only 106 notes (Abstract and Methods). No information is given on sampling procedure, fraction of the source EHR corpus represented, medical sub-domain coverage, or inter-annotator agreement; because all augmented data (up to 10k examples) is generated from this same seed, any selection bias is amplified rather than mitigated.

    Authors: We agree that the manuscript would benefit from additional dataset details. The 106 notes were randomly sampled from EHR notes at a single academic medical center and cover multiple clinical sub-domains, but the exact fraction of the source corpus and inter-annotator agreement statistics were not reported. In the revised version we will add a dedicated dataset subsection describing the sampling procedure, corpus fraction (where known), sub-domain coverage, and any available inter-annotator agreement figures. We will also expand the discussion to acknowledge that augmentation from the same seed can amplify selection bias and note this as a limitation of the current low-resource setting. revision: partial

  2. Referee: [Abstract / Results] No baseline systems (rule-based term extractors, standard biomedical NER models, or simpler ranking methods) are reported, nor are statistical significance tests or confidence intervals provided for the F1/MRR differences (Abstract and Results). This makes it impossible to determine whether the claimed improvements over prompting are reliable.

    Authors: This observation is correct. The revised manuscript will include comparisons against at least two baselines: a rule-based medical term extractor using UMLS and a standard biomedical NER model (BioBERT). We will also add statistical significance testing (paired t-tests across the five folds) and 95% confidence intervals for all reported F1 and MRR differences to allow readers to assess the reliability of the observed gains. revision: yes

  3. Referee: [Results] The claim that data augmentation “performed comparably or better” and that open-source models “outperformed closed-source models” when augmented is load-bearing for the paper’s contribution, yet the evaluation provides no external validation set or out-of-distribution test to separate modeling effects from idiosyncrasies of the 106-note distribution.

    Authors: We acknowledge that the lack of an external or out-of-distribution test set is a genuine limitation. The study was designed around a low-resource scenario with only 106 expert-annotated notes; an external validation set was not available. The 5-fold cross-validation therefore serves as the primary internal evaluation. In the revised discussion we will explicitly qualify the claims by stating that the reported improvements are observed within this distribution and recommend future validation on larger, multi-institutional corpora. No new external data will be added at this time. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with standard CV

full rationale

The paper reports results from an empirical evaluation of LLMs on a fixed set of 106 expert-annotated EHR notes. It compares prompting strategies, fine-tuning, and data augmentation (generated via ChatGPT from the seed notes) using 5-fold cross-validation to compute F1 and MRR. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential claims exist. All reported metrics are computed directly from model outputs on held-out folds; the augmentation process does not create a definitional loop because test performance is measured on unseen notes. No load-bearing self-citations or uniqueness theorems are invoked. The central claims rest on observable experimental outcomes rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an applied empirical study whose central claims rest on the quality of expert annotations and the assumption that automatic metrics reflect real patient utility.

axioms (1)
  • domain assumption Expert annotations on the 106 EHR notes constitute reliable ground truth for what counts as important medical jargon.
    All reported F1 and MRR scores are computed directly against these annotations.

pith-pipeline@v0.9.0 · 5831 in / 1192 out tokens · 54032 ms · 2026-05-23T01:52:23.795765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · 11 internal anchors

  1. [1]

    Delbanco, T. et al. Open notes: doctors and patients signing on (2010)

  2. [2]

    About the blue button movement

    HealthIT.gov. About the blue button movement. https://www.healthit.gov/patients-families/ about-blue-button-movement (2024). [accessed 2024-10-29]

  3. [3]

    Delbanco, T. et al. Inviting patients to read their doctors’ notes: a quasi-experimental study and a look ahead. Annals of internal medicine 157, 461–470 (2012)

  4. [4]

    21st century cures act

    Gabay, M. 21st century cures act. Hospital pharmacy 52, 264–265 (2017)

  5. [5]

    & Williams, B

    Bajwa, J., Munir, U., Nori, A. & Williams, B. Artificial intelligence in healthcare: transforming the practice of medicine. Future healthcare journal 8, e188 (2021)

  6. [6]

    T., Forman, H

    Lye, C. T., Forman, H. P., Daniel, J. G. & Krumholz, H. M. The 21st century cures act and electronic health records one year later: will patients see the benefits? Journal of the American Medical Informatics Association 25, 1218–1220 (2018)

  7. [7]

    Arvisais-Anhalt, S. et al. The 21st century cures act and multiuser electronic health record access: potential pitfalls of information release. Journal of medical Internet research 24, e34085 (2022)

  8. [8]

    A., Clark, C

    Rodriguez, J. A., Clark, C. R. & Bates, D. W. Digital health equity as a necessity in the 21st century cures act era. Jama 323, 2381–2382 (2020)

  9. [9]

    Artificial intelligence and health literacy—proceed with caution

    Nutbeam, D. Artificial intelligence and health literacy—proceed with caution. Health Literacy and Communi- cation Open 1, 2263355 (2023)

  10. [10]

    Root, J. et al. Characteristics of patients who report confusion after reading their primary care clinic notes online. Health communication 31, 778–781 (2016)

  11. [11]

    Kayastha, N., Pollak, K. I. & LeBlanc, T. W. Open oncology notes: a qualitative study of oncology patients’ experiences reading their cancer care notes. Journal of Oncology Practice 14, e251–e258 (2018)

  12. [12]

    Kujala, S. et al. Patients’ experiences of web-based access to electronic health records in finland: Cross-sectional survey. Journal of Medical Internet Research 24, e37438 (2022)

  13. [13]

    Choudhry, A. J. et al. Readability of discharge summaries: with what level of information are we dismissing our patients? The American Journal of Surgery 211, 631–636 (2016)

  14. [14]

    & Mazur, L

    Khasawneh, A., Kratzke, I., Adapa, K., Marks, L. & Mazur, L. Effect of notes’ access and complexity on opennotes’ utility. Applied Clinical Informatics 13, 1015–1023 (2022)

  15. [15]

    Rahimian, M. et al. Open notes sounds great, but will a provider’s documentation change? an exploratory study of the effect of open notes on oncology documentation. JAMIA open 4, ooab051 (2021)

  16. [16]

    Zheng, J. & Yu, H. Readability formulas and user perceptions of electronic health records difficulty: a corpus study. Journal of medical Internet research 19, e59 (2017)

  17. [17]

    Zeng-Treitler, Q. et al. Text characteristics of clinical reports and their implications for the readability of personal health records. Studies in health technology and informatics 129, 1117 (2007)

  18. [18]

    Polepalli Ramesh, B., Houston, T., Brandt, C., Fang, H. & Yu, H. Improving patients’ electronic health record comprehension with noteaid. In MEDINFO 2013, 714–718 (IOS Press, 2013)

  19. [19]

    Sarzynski, E. et al. Opportunities to improve clinical summaries for patients at hospital discharge. BMJ quality & safety 26, 372–380 (2017)

  20. [20]

    C., Doak, L

    Doak, C. C., Doak, L. G. & Root, J. H. Teaching patients with low literacy skills. AJN The American Journal of Nursing 96, 16M (1996)

  21. [21]

    C., Doak, L

    Doak, C. C., Doak, L. G., Friedell, G. H. & Meade, C. D. Improving comprehension for cancer patients with low literacy skills: strategies for clinicians. CA: A Cancer Journal for Clinicians 48, 151–162 (1998)

  22. [22]

    Walsh, T. M. & V olsko, T. A. Readability assessment of internet-based consumer health information.Respiratory care 53, 1310–1315 (2008)

  23. [23]

    E., Han, A., Truntzer, J

    Eltorai, A. E., Han, A., Truntzer, J. & Daniels, A. H. Readability of patient education materials on the american orthopaedic society for sports medicine website. The Physician and Sportsmedicine 42, 125–130 (2014)

  24. [24]

    J., Jansen, J

    Morony, S., Flynn, M., McCaffery, K. J., Jansen, J. & Webster, A. C. Readability of written materials for ckd patients: a systematic review. American Journal of Kidney Diseases 65, 842–850 (2015)

  25. [25]

    B., Farach, F

    Johnson, S. B., Farach, F. J., Pelphrey, K. & Rozenblit, L. Data management in clinical research: synthesizing stakeholder perspectives. Journal of biomedical informatics 60, 286–293 (2016)

  26. [26]

    A., Fiszman, M., Raja, K., Jonnalagadda, S

    Morid, M. A., Fiszman, M., Raja, K., Jonnalagadda, S. R. & Del Fiol, G. Classification of clinically useful sentences in clinical evidence resources. Journal of biomedical informatics 60, 14–22 (2016)

  27. [27]

    & Zeng-Treitler, Q

    Kandula, S., Curtis, D. & Zeng-Treitler, Q. A semantic and syntactic text simplification tool for health content. In AMIA annual symposium proceedings, vol. 2010, 366 (American Medical Informatics Association, 2010)

  28. [28]

    & Rosendale, D

    Zeng-Treitler, Q., Goryachev, S., Kim, H., Keselman, A. & Rosendale, D. Making texts in electronic health records comprehensible to consumers: a prototype translator. In AMIA Annual Symposium Proceedings , vol. 2007, 846 (American Medical Informatics Association, 2007)

  29. [29]

    & Kvist, M

    Abrahamsson, E., Forni, T., Skeppstedt, M. & Kvist, M. Medical text simplification using synonym replacement: Adapting assessment of word difficulty to a compounding language. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), 57–65 (2014)

  30. [30]

    Zheng, J. & Yu, H. Methods for linking ehr notes to education materials. Information Retrieval Journal 19, 174–188 (2016)

  31. [31]

    Chen, J. et al. A natural language processing system that links medical terms in electronic health record notes to lay definitions: system development using physician reviews. Journal of medical Internet research 20, e26 (2018)

  32. [32]

    Kwon, S. et al. Medjex: A medical jargon extraction model with wiki’s hyperlink span and contextualized masked language model score. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2022, 11733 (NIH Public Access, 2022)

  33. [33]

    E., Mouradi, O., Kauchak, D

    Leroy, G., Endicott, J. E., Mouradi, O., Kauchak, D. & Just, M. L. Improving perceived and actual text difficulty for health information consumers using semi-automated methods. In AMIA Annual Symposium Proceedings , vol. 2012, 522 (American Medical Informatics Association, 2012)

  34. [34]

    Chen, J., Zheng, J. & Yu, H. Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations 4, e6373. URL https://medinform.jmir.org/2016/4/e40

  35. [35]

    Chen, J. & Yu, H. Unsupervised ensemble ranking of terms in electronic health record notes based on their im- portance to patients 68, 121–131. URL https://www.sciencedirect.com/science/article/pii/S153204641730045X

  36. [36]

    Aronson, A. R. Metamap: Mapping text to the umls metathesaurus. Bethesda, MD: NLM, NIH, DHHS 1, 26 (2006)

  37. [37]

    & Ammar, W

    Neumann, M., King, D., Beltagy, I. & Ammar, W. Scispacy: fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669 (2019)

  38. [38]

    Eyre, H. et al. Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python. AMIA Annu Symp Proc 2021, 438–447 (2021)

  39. [39]

    & Goharian, N

    Soldaini, L. & Goharian, N. Quickumls: a fast, unsupervised approach for medical concept extraction. In MedIR workshop, sigir, 1–4 (2016)

  40. [40]

    Unified Medical Language System® (UMLS®) – Basics

  41. [41]

    Tian, S. et al. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics 25, bbad493 (2024)

  42. [42]

    Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)

  43. [43]

    Singhal, K. et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023)

  44. [44]

    Tu, T. et al. Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654 (2024)

  45. [45]

    McDuff, D. et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164 (2023)

  46. [46]

    Wu, C. et al. Pmc-llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association ocae045 (2024)

  47. [47]

    Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079 (2023)

  48. [48]

    Tran, H., Yang, Z., Yao, Z. & Yu, H. Bioinstruct: Instruction tuning of large language models for biomedical natural language processing. arXiv preprint arXiv:2310.19975 (2023)

  49. [49]

    Capabilities of GPT-4 on Medical Challenge Problems

    Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023)

  50. [50]

    Kung, T. H. et al. Performance of chatgpt on usmle: potential for ai-assisted medical education using large language models. PLoS digital health 2, e0000198 (2023)

  51. [51]

    Yang, L. et al. Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162 (2024)

  52. [52]

    Yang, Z. et al. Performance of multimodal gpt-4v on usmle with image: potential for imaging diagnostic support with explanations. medRxiv 2023–10 (2023)

  53. [53]

    Yao, Z. et al. Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework. arXiv preprint arXiv:2410.01553 (2024)

  54. [54]

    Hu, Y . et al. Improving large language models for clinical named entity recognition via prompt engineering. Journal of the American Medical Informatics Association ocad259 (2024)

  55. [55]

    Monajatipoor, M. et al. LLMs in Biomedicine: A study on clinical Named Entity Recognition. URL http: //arxiv.org/abs/2404.07376. 2404.07376

  56. [56]

    Hu, D., Liu, B., Zhu, X., Lu, X. & Wu, N. Zero-shot information extraction from radiological reports using chatgpt. International Journal of Medical Informatics 183, 105321 (2024)

  57. [57]

    Liu, S., Wang, A., Xiu, X., Zhong, M. & Wu, S. Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study 12, e59782. URL https://medinform.jmir.org/2024/1/e59782

  58. [58]

    Bose, P. et al. A Survey on Recent Named Entity Recognition and Relationship Extraction Techniques on Clinical Texts 11, 8319. URL https://www.mdpi.com/2076-3417/11/18/8319

  59. [59]

    Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020)

  60. [60]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Liu, Y . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  61. [61]

    Yao, Z., Cao, Y ., Yang, Z., Deshpande, V . & Yu, H. Extracting biomedical factual knowledge using pretrained language model and electronic health record context. In AMIA Annual Symposium Proceedings, vol. 2022, 1188 (2023)

  62. [62]

    Yao, Z., Cao, Y ., Yang, Z. & Yu, H. Context variance evaluation of pretrained language models for prompt-based biomedical knowledge probing. AMIA Summits on Translational Science Proceedings 2023, 592 (2023)

  63. [63]

    Gutierrez, B. J. et al. Thinking about gpt-3 in-context learning for biomedical ie? think again. arXiv preprint arXiv:2203.08410 (2022)

  64. [64]

    & Samwald, M

    Moradi, M., Blagec, K., Haberl, F. & Samwald, M. Gpt-3 models are poor few-shot learners in the biomedical domain. arXiv preprint arXiv:2109.02555 (2021)

  65. [65]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  66. [66]

    Alsentzer, E. et al. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 (2019)

  67. [67]

    H., Kwon, S., Yao, Z., Lalor, J

    Lim, J. H., Kwon, S., Yao, Z., Lalor, J. P. & Yu, H. Large language model-based role-playing for personalized medical jargon extraction. arXiv preprint arXiv:2408.05555 (2024)

  68. [68]

    Openai website

    OpenAI. Openai website. URL https://openai.com/

  69. [69]

    Ghali, M.-K. et al. Gamedx: Generative ai-based medical entity data extractor using large language models. arXiv preprint arXiv:2405.20585 (2024)

  70. [70]

    Butler, J. J. et al. From jargon to clarity: Improving the readability of foot and ankle radiology reports with an artificial intelligence large language model. Foot and Ankle Surgery30, 331–337 (2024)

  71. [71]

    Mannhardt, N. et al. Impact of large language model assistance on patients reading clinical notes: A mixed- methods study. arXiv preprint arXiv:2401.09637 (2024)

  72. [72]

    C., He, Y

    Lu, J., Li, J., Wallace, B. C., He, Y . & Pergola, G. Napss: Paragraph-level medical text simplification via narrative prompting and sentence-matching summarization. arXiv preprint arXiv:2302.05574 (2023)

  73. [73]

    Speier, W., Ong, M. K. & Arnold, C. W. Using phrases and document metadata to improve topic modeling of clinical reports 61, 260–266. URL https://www.sciencedirect.com/science/article/pii/S1532046416300284

  74. [74]

    Wen, Z. et al. Mining heterogeneous clinical notes by multi-modal latent topic model 16, e0249622 (2021. 4. 8.). URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0249622

  75. [75]

    Sun, S., Zack, T., Williams, C. Y . K., Sushil, M. & Butte, A. J. Topic modeling on clinical social work notes for exploring social determinants of health factors7, ooad112. URL https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC10788143/. 38223407

  76. [76]

    N., Fodeh, S

    Chen, J., Jagannatha, A. N., Fodeh, S. J. & Yu, H. Ranking Medical Terms to Support Expansion of Lay Lan- guage Resources for Patient Comprehension of Electronic Health Record Notes: Adapted Distant Supervision Approach 5, e42. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5686421/. 29089288

  77. [77]

    Aronson, A. R. & Lang, F.-M. An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association 17, 229–236 (2010)

  78. [78]

    Yao, Z. et al. Readme: Bridging medical jargon and lay understanding for patient education through data-centric nlp. arXiv preprint arXiv:2312.15561 (2023)

  79. [79]

    Cai, P. et al. Generation of patient after-visit summaries to support physicians. In Proceedings of the 29th International Conference on Computational Linguistics (COLING) (2022)

  80. [80]

    Jiang, A. Q. et al. Mistral 7B. URL http://arxiv.org/abs/2310.06825. 2310.06825

Showing first 80 references.