pith. sign in

arxiv: 2605.17775 · v1 · pith:NAWLXK5Jnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

Pith reviewed 2026-05-20 11:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords synthetic clinical notesLLM rephrasingclinical text evaluationfactualityICD codingMIMICdownstream utility
0
0 comments X

The pith

LLM-rephrased clinical notes keep core information for broad tasks but lose details needed for ICD coding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a large-scale evaluation of clinical notes rephrased by large language models, using intrinsic measures of similarity, extrinsic tests on real clinical prediction tasks, and checks for factual accuracy. It establishes that the rephrased notes hold onto enough information to support coarse-grained clinical predictions and to help train models on uncommon conditions, even as the wording changes a lot. At the same time, they miss the specific details required for accurate ICD coding. The authors show that rephrasing smaller chunks of a note instead of the whole thing recovers much of that lost detail, though it leads to more factual mistakes when the model sees only partial context. This evaluation approach matters for anyone looking to use synthetic clinical text to expand datasets without compromising essential medical content.

Core claim

Through experiments at the million-note scale on data from the MIMIC databases, the work demonstrates that synthetic notes created by rephrasing with LLMs preserve core clinical information and the ability to make useful predictions on coarse-grained tasks, despite large shifts in language. For finer tasks such as ICD coding, however, important details are lost. Rephrasing the notes in chunks rather than as complete documents substantially reduces this loss of detail, but this comes at the expense of lower factual precision when the rephrasing model works with incomplete surrounding context. Synthesis errors turn out to be driven mainly by misinterpretation of clinical context, together with

What carries the argument

Multi-aspect evaluation at million-note scale that combines intrinsic similarity, extrinsic task utility, and factuality assessment on LLM-rephrased notes from clinical databases.

If this is right

  • Synthetic notes can augment training sets for rare ICD codes without being tailored to that task.
  • Chunk-based rephrasing recovers more fine-grained clinical details than full-note rephrasing.
  • Full context during rephrasing improves factual precision compared to incomplete context.
  • Most errors arise from context misinterpretation rather than other fabrication types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These results suggest testing hybrid rephrasing strategies that mix chunk and full-note approaches for different note sections.
  • Similar evaluations could apply to other medical text types such as discharge summaries or radiology reports.
  • The approach opens possibilities for generating privacy-preserving synthetic datasets for broader medical AI development.

Load-bearing premise

The chosen intrinsic, extrinsic, and factuality metrics together give a complete enough picture of synthetic note quality for all downstream clinical applications.

What would settle it

A direct comparison showing that chunk-rephrased notes do not improve ICD coding performance over whole-note rephrased notes on a large test set would falsify the mitigation benefit of chunking.

Figures

Figures reproduced from arXiv: 2605.17775 by Anthony Nguyen, Jinghui Liu, Sarvesh Soni.

Figure 1
Figure 1. Figure 1: Linguistic features of synthetic notes and their [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparing synthetic notes with human-written notes in downstream modeling. Each metric is calculated [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fairness evaluation of the synthetic notes in downstream modeling in comparison with human-written [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of repeated synthesis on 30-day hospi [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of varied prompt instructions on intrin [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a large-scale (million-note) evaluation of LLM-rephrased synthetic clinical notes derived from MIMIC data, combining intrinsic similarity metrics, extrinsic predictive utility tests (including ICD coding), and factuality/error analysis. It claims that synthetic notes retain core clinical information and utility for coarse-grained tasks despite linguistic divergence, but lose fine-grained details; chunk-wise rephrasing substantially mitigates the detail loss for tasks such as ICD coding, albeit at the expense of reduced factual precision under incomplete context. Synthesis errors are dominated by context misinterpretation, temporal confusion, measurement errors, and fabricated claims, and the notes can augment training data for rare ICD codes.

Significance. If the central claims hold, the work supplies a practical, multi-faceted evaluation framework for synthetic clinical text at unprecedented scale and demonstrates a concrete mitigation strategy (chunk rephrasing) that could improve utility for data augmentation in clinical NLP, particularly for rare-event modeling. The million-note scale and complementary intrinsic/extrinsic/factuality lens are genuine strengths that go beyond typical narrow similarity or utility studies.

major comments (2)
  1. [Abstract / Extrinsic evaluation] Abstract and extrinsic-evaluation section: the headline claim that chunk rephrasing 'substantially mitigates' loss of fine-grained detail rests on ICD-coding accuracy as the sole reported extrinsic proxy for detail preservation. Given that the paper's own error analysis identifies temporal confusion and context misinterpretation as dominant failure modes—precisely the phenomena that incomplete chunk context could exacerbate—the mitigation benefit may be task-specific rather than general. Additional fine-grained tasks (medication reconciliation, procedure sequencing, temporal ordering) are needed to support the broader claim.
  2. [Methods] Methods / chunk-rephrasing description: chunk size is listed as a free parameter with no sensitivity analysis or pre-specified justification provided. Because the mitigation result for ICD coding is shown only for the chosen chunking regime, it is unclear whether the reported improvement is robust or post-hoc tuned; a load-bearing sensitivity table or ablation across chunk sizes would be required to substantiate the strategy.
minor comments (2)
  1. [Abstract] The abstract states that the three evaluation families are 'complementary and best viewed in parallel,' yet no quantitative integration or joint statistical test across intrinsic, extrinsic, and factuality scores is reported; readers cannot easily judge overall quality trade-offs.
  2. [Results] Error bars, confidence intervals, or statistical significance tests are absent from the reported metric comparisons; given the million-note scale, even modest effect sizes should be accompanied by uncertainty estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We appreciate the recognition of the scale and multi-faceted nature of our evaluation. Below, we provide point-by-point responses to the major comments and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract / Extrinsic evaluation] Abstract and extrinsic-evaluation section: the headline claim that chunk rephrasing 'substantially mitigates' loss of fine-grained detail rests on ICD-coding accuracy as the sole reported extrinsic proxy for detail preservation. Given that the paper's own error analysis identifies temporal confusion and context misinterpretation as dominant failure modes—precisely the phenomena that incomplete chunk context could exacerbate—the mitigation benefit may be task-specific rather than general. Additional fine-grained tasks (medication reconciliation, procedure sequencing, temporal ordering) are needed to support the broader claim.

    Authors: We acknowledge that our extrinsic evaluation for fine-grained detail preservation relies primarily on ICD coding accuracy as a proxy. ICD coding is a clinically relevant task that requires precise extraction of specific diagnostic information from the notes, making it a strong indicator of detail retention. However, we agree that demonstrating the mitigation effect on additional tasks would strengthen the claim. In the revised manuscript, we will expand the discussion to address why ICD coding serves as a representative task for fine-grained clinical details and include a limitations section noting that the benefits may vary across tasks. If space and resources permit, we will consider adding preliminary results for one additional task such as medication reconciliation. revision: partial

  2. Referee: [Methods] Methods / chunk-rephrasing description: chunk size is listed as a free parameter with no sensitivity analysis or pre-specified justification provided. Because the mitigation result for ICD coding is shown only for the chosen chunking regime, it is unclear whether the reported improvement is robust or post-hoc tuned; a load-bearing sensitivity table or ablation across chunk sizes would be required to substantiate the strategy.

    Authors: We agree that a sensitivity analysis on chunk size would improve the robustness of our findings. The chunk size was chosen based on preliminary experiments to balance context completeness with computational efficiency, but we did not include the full ablation in the original submission. In the revision, we will add a sensitivity analysis table showing ICD coding performance across a range of chunk sizes (e.g., 500, 1000, 2000 tokens) to demonstrate that the improvement is consistent and not tuned to a specific value. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation of synthetic clinical notes

full rationale

The paper performs direct empirical comparisons of LLM-rephrased notes against original MIMIC data using intrinsic similarity metrics, extrinsic task performance (e.g., ICD coding), and factuality checks at scale. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the reported results. Central claims rest on observable differences in linguistic changes, detail preservation, and error modes rather than any quantity defined in terms of itself or reduced by construction to the input data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of MIMIC notes as a source for rephrasing and the adequacy of the three evaluation categories to capture quality. No new physical entities or theoretical constructs are introduced. Experimental choices such as chunk size act as free parameters but are not numerically fitted in the abstract.

free parameters (1)
  • chunk size for rephrasing
    Experimental design choice to balance detail preservation against context completeness; value not numerically specified in abstract.
axioms (1)
  • domain assumption MIMIC clinical notes are representative of real-world clinical documentation for evaluation purposes
    Used as the source database for generating and evaluating synthetic notes at scale.

pith-pipeline@v0.9.0 · 5743 in / 1360 out tokens · 39416 ms · 2026-05-20T11:40:56.787926+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 6 internal anchors

  1. [1]

    Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation

    Carandang, Kristine Ann M and Arana, Jasper Meynard and Casin, Ethan Robert and Monterola, Christopher and Tan, Daniel Stanley and Valenzuela, Jesus Felix B and Alis, Christian. Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation. Proceedings of the 63rd Annual Meeting of the Association for Computatio...

  2. [2]

    FactEHR : A Dataset for Evaluating Factuality in Clinical Notes Using LLMs

    Munnangi, Monica and Swaminathan, Akshay and Fries, Jason Alan and Jindal, Jenelle A and Narayanan, Sanjana and Lopez, Ivan and Tu, Lucia and Chung, Philip and Omiye, Jesutofunmi and Kashyap, Mehr and Shah, Nigam. FactEHR : A Dataset for Evaluating Factuality in Clinical Notes Using LLMs. Machine Learning for Healthcare Conference

  3. [3]

    Causal representation learning from multimodal clinical records under non-random modality missingness

    Liang, Zihan and Pan, Ziwen and Xiong, Ruoxuan. Causal representation learning from multimodal clinical records under non-random modality missingness. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

  4. [4]

    Factuality of large language models: A survey

    Wang, Yuxia and Wang, Minghan and Manzoor, Muhammad Arslan and Liu, Fei and Georgiev, Georgi Nenkov and Das, Rocktim Jyoti and Nakov, Preslav. Factuality of large language models: A survey. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

  5. [5]

    Generation of synthetic clinical text: A systematic review

    Alshaikhdeeb, Basel and Hemedan, Ahmed Abdelmonem and Ghosh, Soumyabrata and Balaur, Irina and Satagopam, Venkata. Generation of synthetic clinical text: A systematic review. arXiv [cs.CL]. arXiv:2507.18451

  6. [6]

    A review on generative AI models for synthetic medical text, time series, and longitudinal data

    Loni, Mohammad and Poursalim, Fatemeh and Asadi, Mehdi and Gharehbaghi, Arash. A review on generative AI models for synthetic medical text, time series, and longitudinal data. npj digital medicine

  7. [7]

    FActScore : Fine-grained atomic evaluation of factual precision in long form text generation

    Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-Tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FActScore : Fine-grained atomic evaluation of factual precision in long form text generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

  8. [8]

    Synth- SBDH : A synthetic dataset of social and behavioral determinants of health for clinical text

    Mitra, Avijit and Yang, Zhichao and Druhl, Emily and Goodwin, Raelene and Yu, Hong. Synth- SBDH : A synthetic dataset of social and behavioral determinants of health for clinical text. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

  9. [9]

    An evaluation of synthetic data augmentation for mitigating covariate bias in health data

    Juwara, Lamin and El-Hussuna, Alaa and El Emam, Khaled. An evaluation of synthetic data augmentation for mitigating covariate bias in health data. Patterns (New York, N.Y.)

  10. [10]

    Demystifying synthetic data in LLM pre-training: A systematic study of scaling laws, benefits, and pitfalls

    Kang, Feiyang and Ardalani, Newsha and Kuchnik, Michael and Emad, Youssef and Elhoushi, Mostafa and Sengupta, Shubhabrata and Li, Shang-Wen and Raghavendra, Ramya and Jia, Ruoxi and Wu, Carole-Jean. Demystifying synthetic data in LLM pre-training: A systematic study of scaling laws, benefits, and pitfalls. Proceedings of the 2025 Conference on Empirical M...

  11. [11]

    Generative models improve fairness of medical classifiers under distribution shifts

    Ktena, Ira and Wiles, Olivia and Albuquerque, Isabela and Rebuffi, Sylvestre-Alvise and Tanno, Ryutaro and Roy, Abhijit Guha and Azizi, Shekoofeh and Belgrave, Danielle and Kohli, Pushmeet and Cemgil, Taylan and Karthikesalingam, Alan and Gowal, Sven. Generative models improve fairness of medical classifiers under distribution shifts. Nature Medicine

  12. [12]

    NoteChat : A dataset of synthetic patient-physician conversations conditioned on clinical notes

    Wang, Junda and Yao, Zonghai and Yang, Zhichao and Zhou, Huixue and Li, Rumeng and Wang, Xun and Xu, Yucheng and Yu, Hong. NoteChat : A dataset of synthetic patient-physician conversations conditioned on clinical notes. Findings of the Association for Computational Linguistics ACL 2024

  13. [13]

    BeyondWeb : Lessons from scaling synthetic data for trillion-scale pretraining

    Maini, Pratyush and Dorna, Vineeth and Doshi, Parth and Carranza, Aldo and Pan, Fan and Urbanek, Jack and Burstein, Paul and Fang, Alex and Deng, Alvin and Abbas, Amro and Larsen, Brett and Blakeney, Cody and Bannur, Charvi and Baek, Christina and Teh, Darren and Schwab, David and Mongstad, Haakon and Yin, Haoli and Wills, Josh and Mentzer, Kaleigh and Me...

  14. [14]

    Verifying facts in patient care documents generated by large language models using electronic health records

    Chung, Philip and Swaminathan, Akshay and Goodell, Alex J and Kim, Yeasul and Momsen Reincke, S and Han, Lichy and Deverett, Ben and Sadeghi, Mohammad Amin and Ariss, Abdel-Badih and Ghanem, Marc and Seong, David and Lee, Andrew A and Coombes, Caitlin E and Bradshaw, Brad and Sufian, Mahir A and Hong, Hyo Jung and Nguyen, Teresa P and Rasouli, Mohammad R ...

  15. [15]

    Evaluation of electronic health record-integrated artificial intelligence chart review

    Kahl, Nicolas M and Frieden, Marshall J and Pope, Zach R and Millen, Marlene M and Tolia, Vaishal M and Chan, Theodore C and Longhurst, Christopher A and Singh, Karandeep and You, Alan X. Evaluation of electronic health record-integrated artificial intelligence chart review. Npj Health Systems

  16. [16]

    Assisting in writing Wikipedia-like articles from scratch with large language models

    Shao, Yijia and Jiang, Yucheng and Kanell, Theodore and Xu, Peter and Khattab, Omar and Lam, Monica. Assisting in writing Wikipedia-like articles from scratch with large language models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

  17. [17]

    Toward relieving clinician burden by automatically generating progress notes using interim hospital data

    Soni, Sarvesh and Demner-Fushman, Dina. Toward relieving clinician burden by automatically generating progress notes using interim hospital data. AMIA Annual Symposium Proceedings

  18. [18]

    Health system-scale language models are all-purpose prediction engines

    Jiang, Lavender Yao and Liu, Xujin Chris and Nejatian, Nima Pour and Nasir-Moin, Mustafa and Wang, Duo and Abidin, Anas and Eaton, Kevin and Riina, Howard Antony and Laufer, Ilya and Punjabi, Paawan and Miceli, Madeline and Kim, Nora C and Orillac, Cordelia and Schnurman, Zane and Livia, Christopher and Weiss, Hannah and Kurland, David and Neifert, Sean a...

  19. [19]

    Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

    Lewis, Patrick and Ott, Myle and Du, Jingfei and Stoyanov, Veselin. Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art. Proceedings of the 3rd Clinical Natural Language Processing Workshop

  20. [20]

    Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study

    Edin, Joakim and Junge, Alexander and Havtorn, Jakob D and Borgholt, Lasse and Maistro, Maria and Ruotsalo, Tuukka and Maaløe, Lars. Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

  21. [21]

    Adapted large language models can outperform medical experts in clinical text summarization

    Van Veen, Dave and Van Uden, Cara and Blankemeier, Louis and Delbrouck, Jean-Benoit and Aali, Asad and Bluethgen, Christian and Pareek, Anuj and Polacin, Malgorzata and Reis, Eduardo Pontes and Seehofnerová, Anna and Rohatgi, Nidhi and Hosamani, Poonam and Collins, William and Ahuja, Neera and Langlotz, Curtis P and Hom, Jason and Gatidis, Sergios and Pau...

  22. [22]

    Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

    Yim, Wen-Wai and Fu, Yujuan and Ben Abacha, Asma and Snider, Neal and Lin, Thomas and Yetisgen, Meliha. Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Scientific data

  23. [23]

    The Llama 3 Herd of Models

    Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and Goyal, Anirudh and Hartshorn, Anthony and Yang, Aobo and Mitra, Archi and Sravankumar, Archie and Korenev, Artem and Hinsvark, Arthur and Rao, Arun and Zhang, Aston and ...

  24. [24]

    Large language models are less effective at clinical prediction tasks than locally trained machine learning models

    Brown, Katherine E and Yan, Chao and Li, Zhuohang and Zhang, Xinmeng and Collins, Benjamin X and Chen, You and Clayton, Ellen Wright and Kantarcioglu, Murat and Vorobeychik, Yevgeniy and Malin, Bradley A. Large language models are less effective at clinical prediction tasks than locally trained machine learning models. Journal of the American Medical Info...

  25. [25]

    Using synthetic health care data to leverage large language models for named entity recognition: Development and validation study

    Šuvalov, Hendrik and Lepson, Mihkel and Kukk, Veronika and Malk, Maria and Ilves, Neeme and Kuulmets, Hele-Andra and Kolde, Raivo. Using synthetic health care data to leverage large language models for named entity recognition: Development and validation study. Journal of medical internet research

  26. [26]

    Efficient memory management for large language model serving with PagedAttention

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion. Efficient memory management for large language model serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles

  27. [27]

    Evaluating hospital course summarization by an electronic health record-based large language model

    Small, William R and Austrian, Jonathan and O'Donnell, Luke and Burk-Rafel, Jesse and Hochman, Katherine A and Goodman, Adam and Zaretsky, Jonah and Martin, Jacob and Johnson, Stephen and Major, Vincent J and Jones, Simon and Henke, Christian and Verplanke, Benjamin and Osso, Jwan and Larson, Ian and Saxena, Archana and Mednick, Aron and Simonis, Choumika...

  28. [28]

    DeBERTaV3 : Improving DeBERTa using ELECTRA -Style Pre-Training with Gradient-Disentangled Embedding Sharing

    He, Pengcheng and Gao, Jianfeng and Chen, Weizhu. DeBERTaV3 : Improving DeBERTa using ELECTRA -Style Pre-Training with Gradient-Disentangled Embedding Sharing. The Eleventh International Conference on Learning Representations

  29. [29]

    Generating synthetic clinical text with local large language models to identify misdiagnosed limb fractures in radiology reports

    Liu, Jinghui and Koopman, Bevan and Brown, Nathan J and Chu, Kevin and Nguyen, Anthony. Generating synthetic clinical text with local large language models to identify misdiagnosed limb fractures in radiology reports. Artificial intelligence in medicine

  30. [30]

    SMOG grading - A new readability formula

    Harry, G and Laughlin, Mc. SMOG grading - A new readability formula. The Journal of Reading

  31. [31]

    Large language models can support generation of standardized discharge summaries - A retrospective study utilizing ChatGPT -4 and electronic health records

    Schwieger, Arne and Angst, Katrin and de Bardeci, Mateo and Burrer, Achim and Cathomas, Flurin and Ferrea, Stefano and Grätz, Franziska and Knorr, Marius and Kronenberg, Golo and Spiller, Tobias and Troi, David and Seifritz, Erich and Weber, Samantha and Olbrich, Sebastian. Large language models can support generation of standardized discharge summaries -...

  32. [32]

    Equality of opportunity in supervised learning

    Hardt, Moritz and Price, Eric and Srebro, Nathan. Equality of opportunity in supervised learning. Proceedings of the 30th International Conference on Neural Information Processing Systems

  33. [33]

    DataComp - LM : In search of the next generation of training sets for language models

    Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir Yitzhak and Bansal, Hritik and Guha, Etash Kumar and Keh, Sedrick and Arora, Kushal and Garg, Saurabh and Xin, Rui and Muennighoff, Niklas and Heckel, Reinhard and Mercat, Jean and Chen, Mayee F and Gururangan, Suchin and Wortsman, Mitchell and Albalak, Alon ...

  34. [34]

    Rephrasing Electronic Health Records for Pretraining Clinical Language Models

    Liu, Jinghui and Nguyen, Anthony. Rephrasing Electronic Health Records for Pretraining Clinical Language Models. Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association

  35. [35]

    Evaluating text complexity and Flesch-Kincaid grade level

    Solnyshkina, M and Zamaletdinov, R and Gorodetskaya, L and Gabitov, A. Evaluating text complexity and Flesch-Kincaid grade level. Journal of social studies education research

  36. [36]

    MIMIC- III , a freely accessible critical care database

    Johnson, Alistair E W and Pollard, Tom J and Shen, Lu and Lehman, Li-Wei H and Feng, Mengling and Ghassemi, Mohammad and Moody, Benjamin and Szolovits, Peter and Celi, Leo Anthony and Mark, Roger G. MIMIC- III , a freely accessible critical care database. Scientific data

  37. [37]

    Textbooks Are All You Need

    Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti and Mendes, Caio César Teodoro and Del Giorno, Allie and Gopi, Sivakanth and Javaheripi, Mojan and Kauffmann, Piero and de Rosa, Gustavo and Saarikivi, Olli and Salim, Adil and Shah, Shital and Behl, Harkirat Singh and Wang, Xin and Bubeck, Sébastien and Eldan, Ronen and Kalai, Adam Tauman and Lee, Yin Tat ...

  38. [38]

    A Survey of Evaluation Metrics Used for NLG Systems

    Sai, Ananya B and Mohankumar, Akash Kumar and Khapra, Mitesh M. A Survey of Evaluation Metrics Used for NLG Systems. ACM Comput. Surv

  39. [39]

    Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data

    Li, Rumeng and Wang, Xun and Yu, Hong. Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data. Findings of the Association for Computational Linguistics: EMNLP 2023

  40. [40]

    ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

    Huang, Kexin and Altosaar, Jaan and Ranganath, Rajesh. ClinicalBERT : Modeling Clinical Notes and Predicting Hospital Readmission. arXiv [cs.CL]. arXiv:1904.05342

  41. [41]

    Hierarchical label-wise attention transformer model for explainable ICD coding

    Liu, Leibo and Perez-Concha, Oscar and Nguyen, Anthony and Bennett, Vicki and Jorm, Louisa. Hierarchical label-wise attention transformer model for explainable ICD coding. Journal of biomedical informatics

  42. [42]

    Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction

    Si, Yuqi and Roberts, Kirk. Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science

  43. [43]

    MIMIC- IV , a freely accessible electronic health record dataset

    Johnson, Alistair E W and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J and Moody, Benjamin and Gow, Brian and Lehman, Li-Wei H and Celi, Leo A and Mark, Roger G. MIMIC- IV , a freely accessible electronic health record dataset. Scientific data

  44. [44]

    Unintended Consequences of Nationwide Electronic Health Record Adoption: Challenges and Opportunities in the Post-Meaningful Use Era

    Colicchio, Tiago K and Cimino, James J and Del Fiol, Guilherme. Unintended Consequences of Nationwide Electronic Health Record Adoption: Challenges and Opportunities in the Post-Meaningful Use Era. Journal of medical Internet research

  45. [45]

    Evaluating progress in automatic chest X -ray radiology report generation

    Yu, Feiyang and Endo, Mark and Krishnan, Rayan and Pan, Ian and Tsai, Andy and Reis, Eduardo Pontes and Fonseca, Eduardo Kaiser Ururahy and Lee, Henrique Min Ho and Abad, Zahra Shakeri Hossein and Ng, Andrew Y and Langlotz, Curtis P and Venugopal, Vasantha Kumar and Rajpurkar, Pranav. Evaluating progress in automatic chest X -ray radiology report generati...

  46. [46]

    ``Note Bloat'' impacts deep learning-based NLP models for clinical prediction tasks

    Liu, Jinghui and Capurro, Daniel and Nguyen, Anthony and Verspoor, Karin. ``Note Bloat'' impacts deep learning-based NLP models for clinical prediction tasks. Journal of biomedical informatics

  47. [47]

    DRG - LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients

    Wang, Hanyin and Gao, Chufan and Dantona, Christopher and Hull, Bryan and Sun, Jimeng. DRG - LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ digital medicine. arXiv:2309.12625

  48. [48]

    PLM - ICD : Automatic ICD Coding with Pretrained Language Models

    Huang, Chao-Wei and Tsai, Shang-Chi and Chen, Yun-Nung. PLM - ICD : Automatic ICD Coding with Pretrained Language Models. Proceedings of the 4th Clinical Natural Language Processing Workshop

  49. [49]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out

  50. [50]

    Mistral 7B

    Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, Lélio Renard and Lachaux, Marie-Anne and Stock, Pierre and Le Scao, Teven and Lavril, Thibaut and Wang, Thomas and Lacroix, Ti...

  51. [51]

    Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

    Tang, Ruixiang and Han, Xiaotian and Jiang, Xiaoqian and Hu, Xia. Does Synthetic Data Generation of LLMs Help Clinical Text Mining?. arXiv [cs.CL]. arXiv:2303.04360

  52. [52]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Penedo, Guilherme and Kydlíček, Hynek and Allal, Loubna Ben and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Von Werra, Leandro and Wolf, Thomas. The FineWeb datasets: Decanting the web for the finest text data at scale. arXiv [cs.CL]. arXiv:2406.17557

  53. [53]

    BERTScore : Evaluating Text Generation with BERT

    Zhang*, Tianyi and Kishore*, Varsha and Wu*, Felix and Weinberger, Kilian Q and Artzi, Yoav. BERTScore : Evaluating Text Generation with BERT. International Conference on Learning Representations

  54. [54]

    Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

    Kweon, Sunjun and Kim, Junu and Kim, Jiyoun and Im, Sujeong and Cho, Eunbyeol and Bae, Seongsu and Oh, Jungwoo and Lee, Gyubok and Moon, Jong Hak and You, Seng Chan and Baek, Seungjin and Han, Chang Hoon and Jung, Yoon Bin and Jo, Yohan and Choi, Edward. Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes. Findings of the As...

  55. [55]

    Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

    Maini, Pratyush and Seto, Skyler and Bai, Richard and Grangier, David and Zhang, Yizhe and Jaitly, Navdeep. Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  56. [56]

    Discharge Me!

    Xu, Justin and Chen, Zhihong and Johnston, Andrew and Blankemeier, Louis and Varma, Maya and Hom, Jason and Collins, William J and Modi, Ankit and Lloyd, Robert and Hopkins, Benjamin and Langlotz, Curtis and Delbrouck, Jean-Benoit. Overview of the First Shared Task on Clinical Text Generation: RRG24 and “Discharge Me!”. Proceedings of the 23rd Workshop on...

  57. [57]

    Hard for humans, hard for machines: predicting readmission after psychiatric hospitalization using narrative notes

    Boag, William and Kovaleva, Olga and McCoy, Jr, Thomas H and Rumshisky, Anna and Szolovits, Peter and Perlis, Roy H. Hard for humans, hard for machines: predicting readmission after psychiatric hospitalization using narrative notes. Translational psychiatry

  58. [58]

    e-Health CSIRO at RRG24 : Entropy-Augmented Self-Critical Sequence Training for Radiology Report Generation

    Nicolson, Aaron and Liu, Jinghui and Dowling, Jason and Nguyen, Anthony and Koopman, Bevan. e-Health CSIRO at RRG24 : Entropy-Augmented Self-Critical Sequence Training for Radiology Report Generation. Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

  59. [59]

    Qwen2.5 Technical Report

    Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...

  60. [60]

    RULER : What's the Real Context Size of Your Long-Context Language Models?

    Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Ginsburg, Boris. RULER : What's the Real Context Size of Your Long-Context Language Models?. First Conference on Language Modeling

  61. [61]

    Lost in the middle: How language models use long contexts

    Liu, Nelson F and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics

  62. [62]

    A unified review of deep learning for automated medical coding

    Ji, Shaoxiong and Li, Xiaobo and Sun, Wei and Dong, Hang and Taalas, Ara and Zhang, Yijia and Wu, Honghan and Pitkänen, Esa and Marttinen, Pekka. A unified review of deep learning for automated medical coding. ACM computing surveys

  63. [63]

    Current and future state of evaluation of large language models for medical summarization tasks

    Croxford, Emma and Gao, Yanjun and Pellegrino, Nicholas and Wong, Karen and Wills, Graham and First, Elliot and Liao, Frank and Goswami, Cherodeep and Patterson, Brian and Afshar, Majid. Current and future state of evaluation of large language models for medical summarization tasks. npj Health Systems

  64. [64]

    Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models

    Gao, Yanjun and Dligach, Dmitriy and Miller, Timothy and Xu, Dongfang and Churpek, Matthew M M and Afshar, Majid. Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models. Proceedings of the 29th International Conference on Computational Linguistics

  65. [65]

    Paraphrasing to improve the performance of Electronic Health Records Question Answering

    Soni, Sarvesh and Roberts, Kirk. Paraphrasing to improve the performance of Electronic Health Records Question Answering. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science

  66. [66]

    AI models collapse when trained on recursively generated data

    Shumailov, Ilia and Shumaylov, Zakhar and Zhao, Yiren and Papernot, Nicolas and Anderson, Ross and Gal, Yarin. AI models collapse when trained on recursively generated data. Nature

  67. [67]

    QuickUMLS : a fast, unsupervised approach for medical concept extraction

    Soldaini, Luca. QuickUMLS : a fast, unsupervised approach for medical concept extraction. MedIR workshop, sigir

  68. [68]

    Large language models for reducing clinicians' documentation burden

    Roberts, Kirk. Large language models for reducing clinicians' documentation burden. Nature medicine

  69. [69]

    A survey of bias in machine learning through the prism of statistical parity

    Besse, Philippe and del Barrio, Eustasio and Gordaliza, Paula and Loubes, Jean-Michel and Risser, Laurent. A survey of bias in machine learning through the prism of statistical parity. The American statistician

  70. [70]

    Aligning AI research with the needs of clinical coding workflows: Eight recommendations based on US data analysis and critical review

    Gan, Yidong and Rybinski, Maciej and Hachey, Ben and Kummerfeld, Jonathan K. Aligning AI research with the needs of clinical coding workflows: Eight recommendations based on US data analysis and critical review. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  71. [71]

    DocLens : Multi-aspect fine-grained medical text evaluation

    Xie, Yiqing and Zhang, Sheng and Cheng, Hao and Liu, Pengfei and Gero, Zelalem and Wong, Cliff and Naumann, Tristan and Poon, Hoifung and Rose, Carolyn. DocLens : Multi-aspect fine-grained medical text evaluation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  72. [72]

    Less is more: Explainable and efficient ICD code prediction with clinical entities

    Douglas, James C and Gan, Yidong and Hachey, Ben and Kummerfeld, Jonathan K. Less is more: Explainable and efficient ICD code prediction with clinical entities. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)