Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

Anthony Nguyen; Jinghui Liu; Sarvesh Soni

arxiv: 2605.17775 · v1 · pith:NAWLXK5Jnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

Jinghui Liu , Sarvesh Soni , Anthony Nguyen This is my paper

Pith reviewed 2026-05-20 11:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords synthetic clinical notesLLM rephrasingclinical text evaluationfactualityICD codingMIMICdownstream utility

0 comments

The pith

LLM-rephrased clinical notes keep core information for broad tasks but lose details needed for ICD coding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a large-scale evaluation of clinical notes rephrased by large language models, using intrinsic measures of similarity, extrinsic tests on real clinical prediction tasks, and checks for factual accuracy. It establishes that the rephrased notes hold onto enough information to support coarse-grained clinical predictions and to help train models on uncommon conditions, even as the wording changes a lot. At the same time, they miss the specific details required for accurate ICD coding. The authors show that rephrasing smaller chunks of a note instead of the whole thing recovers much of that lost detail, though it leads to more factual mistakes when the model sees only partial context. This evaluation approach matters for anyone looking to use synthetic clinical text to expand datasets without compromising essential medical content.

Core claim

Through experiments at the million-note scale on data from the MIMIC databases, the work demonstrates that synthetic notes created by rephrasing with LLMs preserve core clinical information and the ability to make useful predictions on coarse-grained tasks, despite large shifts in language. For finer tasks such as ICD coding, however, important details are lost. Rephrasing the notes in chunks rather than as complete documents substantially reduces this loss of detail, but this comes at the expense of lower factual precision when the rephrasing model works with incomplete surrounding context. Synthesis errors turn out to be driven mainly by misinterpretation of clinical context, together with

What carries the argument

Multi-aspect evaluation at million-note scale that combines intrinsic similarity, extrinsic task utility, and factuality assessment on LLM-rephrased notes from clinical databases.

If this is right

Synthetic notes can augment training sets for rare ICD codes without being tailored to that task.
Chunk-based rephrasing recovers more fine-grained clinical details than full-note rephrasing.
Full context during rephrasing improves factual precision compared to incomplete context.
Most errors arise from context misinterpretation rather than other fabrication types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These results suggest testing hybrid rephrasing strategies that mix chunk and full-note approaches for different note sections.
Similar evaluations could apply to other medical text types such as discharge summaries or radiology reports.
The approach opens possibilities for generating privacy-preserving synthetic datasets for broader medical AI development.

Load-bearing premise

The chosen intrinsic, extrinsic, and factuality metrics together give a complete enough picture of synthetic note quality for all downstream clinical applications.

What would settle it

A direct comparison showing that chunk-rephrased notes do not improve ICD coding performance over whole-note rephrased notes on a large test set would falsify the mitigation benefit of chunking.

Figures

Figures reproduced from arXiv: 2605.17775 by Anthony Nguyen, Jinghui Liu, Sarvesh Soni.

**Figure 2.** Figure 2: Comparing synthetic notes with human-written notes in downstream modeling. Each metric is calculated [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Fairness evaluation of the synthetic notes in downstream modeling in comparison with human-written [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Impact of repeated synthesis on 30-day hospi [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 4.** Figure 4: Impact of varied prompt instructions on intrin [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chunk rephrasing helps recover fine details for ICD coding in LLM synthetic notes but trades off factuality, and the benefit looks tied to that one task.

read the letter

The main thing to know is that this evaluation at million-note scale finds LLM-rephrased clinical notes hold onto the main clinical facts and work for rough predictions, but they drop specifics important for ICD coding. Rephrasing by chunks instead of whole notes fixes a lot of that detail loss, though it leads to more factual slips from incomplete context. What the paper does is run intrinsic checks on text similarity, extrinsic tests on model performance for clinical predictions, and factuality reviews against the originals. The error analysis identifies context misinterpretation, temporal mix-ups, measurement mistakes, and fabricated elements as the top issues. They also show these synthetic notes can help train better models for rare ICD codes by augmenting the data. This is new in combining all three evaluation types at this scale with the chunk strategy. It builds on earlier work but adds the mitigation angle and the augmentation result. The potential issue is whether the chunk benefit is broad or mostly for the ICD task they tested. The stress-test concern is valid here: other detail-heavy tasks like handling medication reconciliation or ordering events in time might not benefit and could suffer from the split context, especially since temporal confusion is already a noted error type. The abstract does not give the exact numbers or statistical details, so the strength of the claims depends on the full results and methods. This paper is for researchers working on synthetic clinical text for privacy or augmentation in medical NLP. A reader interested in practical ways to generate usable synthetic data will get value from the scale and the error breakdown. It deserves serious referee time because the large scale and direct comparisons make the findings worth verifying, even if revisions are needed on the generalizability of the chunk approach.

Referee Report

2 major / 2 minor

Summary. The paper conducts a large-scale (million-note) evaluation of LLM-rephrased synthetic clinical notes derived from MIMIC data, combining intrinsic similarity metrics, extrinsic predictive utility tests (including ICD coding), and factuality/error analysis. It claims that synthetic notes retain core clinical information and utility for coarse-grained tasks despite linguistic divergence, but lose fine-grained details; chunk-wise rephrasing substantially mitigates the detail loss for tasks such as ICD coding, albeit at the expense of reduced factual precision under incomplete context. Synthesis errors are dominated by context misinterpretation, temporal confusion, measurement errors, and fabricated claims, and the notes can augment training data for rare ICD codes.

Significance. If the central claims hold, the work supplies a practical, multi-faceted evaluation framework for synthetic clinical text at unprecedented scale and demonstrates a concrete mitigation strategy (chunk rephrasing) that could improve utility for data augmentation in clinical NLP, particularly for rare-event modeling. The million-note scale and complementary intrinsic/extrinsic/factuality lens are genuine strengths that go beyond typical narrow similarity or utility studies.

major comments (2)

[Abstract / Extrinsic evaluation] Abstract and extrinsic-evaluation section: the headline claim that chunk rephrasing 'substantially mitigates' loss of fine-grained detail rests on ICD-coding accuracy as the sole reported extrinsic proxy for detail preservation. Given that the paper's own error analysis identifies temporal confusion and context misinterpretation as dominant failure modes—precisely the phenomena that incomplete chunk context could exacerbate—the mitigation benefit may be task-specific rather than general. Additional fine-grained tasks (medication reconciliation, procedure sequencing, temporal ordering) are needed to support the broader claim.
[Methods] Methods / chunk-rephrasing description: chunk size is listed as a free parameter with no sensitivity analysis or pre-specified justification provided. Because the mitigation result for ICD coding is shown only for the chosen chunking regime, it is unclear whether the reported improvement is robust or post-hoc tuned; a load-bearing sensitivity table or ablation across chunk sizes would be required to substantiate the strategy.

minor comments (2)

[Abstract] The abstract states that the three evaluation families are 'complementary and best viewed in parallel,' yet no quantitative integration or joint statistical test across intrinsic, extrinsic, and factuality scores is reported; readers cannot easily judge overall quality trade-offs.
[Results] Error bars, confidence intervals, or statistical significance tests are absent from the reported metric comparisons; given the million-note scale, even modest effect sizes should be accompanied by uncertainty estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We appreciate the recognition of the scale and multi-faceted nature of our evaluation. Below, we provide point-by-point responses to the major comments and indicate the revisions we plan to make.

read point-by-point responses

Referee: [Abstract / Extrinsic evaluation] Abstract and extrinsic-evaluation section: the headline claim that chunk rephrasing 'substantially mitigates' loss of fine-grained detail rests on ICD-coding accuracy as the sole reported extrinsic proxy for detail preservation. Given that the paper's own error analysis identifies temporal confusion and context misinterpretation as dominant failure modes—precisely the phenomena that incomplete chunk context could exacerbate—the mitigation benefit may be task-specific rather than general. Additional fine-grained tasks (medication reconciliation, procedure sequencing, temporal ordering) are needed to support the broader claim.

Authors: We acknowledge that our extrinsic evaluation for fine-grained detail preservation relies primarily on ICD coding accuracy as a proxy. ICD coding is a clinically relevant task that requires precise extraction of specific diagnostic information from the notes, making it a strong indicator of detail retention. However, we agree that demonstrating the mitigation effect on additional tasks would strengthen the claim. In the revised manuscript, we will expand the discussion to address why ICD coding serves as a representative task for fine-grained clinical details and include a limitations section noting that the benefits may vary across tasks. If space and resources permit, we will consider adding preliminary results for one additional task such as medication reconciliation. revision: partial
Referee: [Methods] Methods / chunk-rephrasing description: chunk size is listed as a free parameter with no sensitivity analysis or pre-specified justification provided. Because the mitigation result for ICD coding is shown only for the chosen chunking regime, it is unclear whether the reported improvement is robust or post-hoc tuned; a load-bearing sensitivity table or ablation across chunk sizes would be required to substantiate the strategy.

Authors: We agree that a sensitivity analysis on chunk size would improve the robustness of our findings. The chunk size was chosen based on preliminary experiments to balance context completeness with computational efficiency, but we did not include the full ablation in the original submission. In the revision, we will add a sensitivity analysis table showing ICD coding performance across a range of chunk sizes (e.g., 500, 1000, 2000 tokens) to demonstrate that the improvement is consistent and not tuned to a specific value. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation of synthetic clinical notes

full rationale

The paper performs direct empirical comparisons of LLM-rephrased notes against original MIMIC data using intrinsic similarity metrics, extrinsic task performance (e.g., ICD coding), and factuality checks at scale. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the reported results. Central claims rest on observable differences in linguistic changes, detail preservation, and error modes rather than any quantity defined in terms of itself or reduced by construction to the input data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of MIMIC notes as a source for rephrasing and the adequacy of the three evaluation categories to capture quality. No new physical entities or theoretical constructs are introduced. Experimental choices such as chunk size act as free parameters but are not numerically fitted in the abstract.

free parameters (1)

chunk size for rephrasing
Experimental design choice to balance detail preservation against context completeness; value not numerically specified in abstract.

axioms (1)

domain assumption MIMIC clinical notes are representative of real-world clinical documentation for evaluation purposes
Used as the source database for generating and evaluating synthetic notes at scale.

pith-pipeline@v0.9.0 · 5743 in / 1360 out tokens · 39416 ms · 2026-05-20T11:40:56.787926+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

systematic evaluation ... intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

loss of fine-grained details ... mitigated by rephrasing notes by chunks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 6 internal anchors

[1]

Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation

Carandang, Kristine Ann M and Arana, Jasper Meynard and Casin, Ethan Robert and Monterola, Christopher and Tan, Daniel Stanley and Valenzuela, Jesus Felix B and Alis, Christian. Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation. Proceedings of the 63rd Annual Meeting of the Association for Computatio...

work page
[2]

FactEHR : A Dataset for Evaluating Factuality in Clinical Notes Using LLMs

Munnangi, Monica and Swaminathan, Akshay and Fries, Jason Alan and Jindal, Jenelle A and Narayanan, Sanjana and Lopez, Ivan and Tu, Lucia and Chung, Philip and Omiye, Jesutofunmi and Kashyap, Mehr and Shah, Nigam. FactEHR : A Dataset for Evaluating Factuality in Clinical Notes Using LLMs. Machine Learning for Healthcare Conference

work page
[3]

Causal representation learning from multimodal clinical records under non-random modality missingness

Liang, Zihan and Pan, Ziwen and Xiong, Ruoxuan. Causal representation learning from multimodal clinical records under non-random modality missingness. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page 2025
[4]

Factuality of large language models: A survey

Wang, Yuxia and Wang, Minghan and Manzoor, Muhammad Arslan and Liu, Fei and Georgiev, Georgi Nenkov and Das, Rocktim Jyoti and Nakov, Preslav. Factuality of large language models: A survey. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

work page 2024
[5]

Generation of synthetic clinical text: A systematic review

Alshaikhdeeb, Basel and Hemedan, Ahmed Abdelmonem and Ghosh, Soumyabrata and Balaur, Irina and Satagopam, Venkata. Generation of synthetic clinical text: A systematic review. arXiv [cs.CL]. arXiv:2507.18451

work page arXiv
[6]

A review on generative AI models for synthetic medical text, time series, and longitudinal data

Loni, Mohammad and Poursalim, Fatemeh and Asadi, Mehdi and Gharehbaghi, Arash. A review on generative AI models for synthetic medical text, time series, and longitudinal data. npj digital medicine

work page
[7]

FActScore : Fine-grained atomic evaluation of factual precision in long form text generation

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-Tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FActScore : Fine-grained atomic evaluation of factual precision in long form text generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023
[8]

Synth- SBDH : A synthetic dataset of social and behavioral determinants of health for clinical text

Mitra, Avijit and Yang, Zhichao and Druhl, Emily and Goodwin, Raelene and Yu, Hong. Synth- SBDH : A synthetic dataset of social and behavioral determinants of health for clinical text. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page 2025
[9]

An evaluation of synthetic data augmentation for mitigating covariate bias in health data

Juwara, Lamin and El-Hussuna, Alaa and El Emam, Khaled. An evaluation of synthetic data augmentation for mitigating covariate bias in health data. Patterns (New York, N.Y.)

work page
[10]

Demystifying synthetic data in LLM pre-training: A systematic study of scaling laws, benefits, and pitfalls

Kang, Feiyang and Ardalani, Newsha and Kuchnik, Michael and Emad, Youssef and Elhoushi, Mostafa and Sengupta, Shubhabrata and Li, Shang-Wen and Raghavendra, Ramya and Jia, Ruoxi and Wu, Carole-Jean. Demystifying synthetic data in LLM pre-training: A systematic study of scaling laws, benefits, and pitfalls. Proceedings of the 2025 Conference on Empirical M...

work page 2025
[11]

Generative models improve fairness of medical classifiers under distribution shifts

Ktena, Ira and Wiles, Olivia and Albuquerque, Isabela and Rebuffi, Sylvestre-Alvise and Tanno, Ryutaro and Roy, Abhijit Guha and Azizi, Shekoofeh and Belgrave, Danielle and Kohli, Pushmeet and Cemgil, Taylan and Karthikesalingam, Alan and Gowal, Sven. Generative models improve fairness of medical classifiers under distribution shifts. Nature Medicine

work page
[12]

NoteChat : A dataset of synthetic patient-physician conversations conditioned on clinical notes

Wang, Junda and Yao, Zonghai and Yang, Zhichao and Zhou, Huixue and Li, Rumeng and Wang, Xun and Xu, Yucheng and Yu, Hong. NoteChat : A dataset of synthetic patient-physician conversations conditioned on clinical notes. Findings of the Association for Computational Linguistics ACL 2024

work page 2024
[13]

BeyondWeb : Lessons from scaling synthetic data for trillion-scale pretraining

Maini, Pratyush and Dorna, Vineeth and Doshi, Parth and Carranza, Aldo and Pan, Fan and Urbanek, Jack and Burstein, Paul and Fang, Alex and Deng, Alvin and Abbas, Amro and Larsen, Brett and Blakeney, Cody and Bannur, Charvi and Baek, Christina and Teh, Darren and Schwab, David and Mongstad, Haakon and Yin, Haoli and Wills, Josh and Mentzer, Kaleigh and Me...

work page arXiv
[14]

Verifying facts in patient care documents generated by large language models using electronic health records

Chung, Philip and Swaminathan, Akshay and Goodell, Alex J and Kim, Yeasul and Momsen Reincke, S and Han, Lichy and Deverett, Ben and Sadeghi, Mohammad Amin and Ariss, Abdel-Badih and Ghanem, Marc and Seong, David and Lee, Andrew A and Coombes, Caitlin E and Bradshaw, Brad and Sufian, Mahir A and Hong, Hyo Jung and Nguyen, Teresa P and Rasouli, Mohammad R ...

work page
[15]

Evaluation of electronic health record-integrated artificial intelligence chart review

Kahl, Nicolas M and Frieden, Marshall J and Pope, Zach R and Millen, Marlene M and Tolia, Vaishal M and Chan, Theodore C and Longhurst, Christopher A and Singh, Karandeep and You, Alan X. Evaluation of electronic health record-integrated artificial intelligence chart review. Npj Health Systems

work page
[16]

Assisting in writing Wikipedia-like articles from scratch with large language models

Shao, Yijia and Jiang, Yucheng and Kanell, Theodore and Xu, Peter and Khattab, Omar and Lam, Monica. Assisting in writing Wikipedia-like articles from scratch with large language models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

work page 2024
[17]

Toward relieving clinician burden by automatically generating progress notes using interim hospital data

Soni, Sarvesh and Demner-Fushman, Dina. Toward relieving clinician burden by automatically generating progress notes using interim hospital data. AMIA Annual Symposium Proceedings

work page
[18]

Health system-scale language models are all-purpose prediction engines

Jiang, Lavender Yao and Liu, Xujin Chris and Nejatian, Nima Pour and Nasir-Moin, Mustafa and Wang, Duo and Abidin, Anas and Eaton, Kevin and Riina, Howard Antony and Laufer, Ilya and Punjabi, Paawan and Miceli, Madeline and Kim, Nora C and Orillac, Cordelia and Schnurman, Zane and Livia, Christopher and Weiss, Hannah and Kurland, David and Neifert, Sean a...

work page
[19]

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

Lewis, Patrick and Ott, Myle and Du, Jingfei and Stoyanov, Veselin. Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art. Proceedings of the 3rd Clinical Natural Language Processing Workshop

work page
[20]

Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study

Edin, Joakim and Junge, Alexander and Havtorn, Jakob D and Borgholt, Lasse and Maistro, Maria and Ruotsalo, Tuukka and Maaløe, Lars. Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

work page
[21]

Adapted large language models can outperform medical experts in clinical text summarization

Van Veen, Dave and Van Uden, Cara and Blankemeier, Louis and Delbrouck, Jean-Benoit and Aali, Asad and Bluethgen, Christian and Pareek, Anuj and Polacin, Malgorzata and Reis, Eduardo Pontes and Seehofnerová, Anna and Rohatgi, Nidhi and Hosamani, Poonam and Collins, William and Ahuja, Neera and Langlotz, Curtis P and Hom, Jason and Gatidis, Sergios and Pau...

work page
[22]

Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

Yim, Wen-Wai and Fu, Yujuan and Ben Abacha, Asma and Snider, Neal and Lin, Thomas and Yetisgen, Meliha. Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Scientific data

work page
[23]

The Llama 3 Herd of Models

Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and Goyal, Anirudh and Hartshorn, Anthony and Yang, Aobo and Mitra, Archi and Sravankumar, Archie and Korenev, Artem and Hinsvark, Arthur and Rao, Arun and Zhang, Aston and ...

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Large language models are less effective at clinical prediction tasks than locally trained machine learning models

Brown, Katherine E and Yan, Chao and Li, Zhuohang and Zhang, Xinmeng and Collins, Benjamin X and Chen, You and Clayton, Ellen Wright and Kantarcioglu, Murat and Vorobeychik, Yevgeniy and Malin, Bradley A. Large language models are less effective at clinical prediction tasks than locally trained machine learning models. Journal of the American Medical Info...

work page
[25]

Using synthetic health care data to leverage large language models for named entity recognition: Development and validation study

Šuvalov, Hendrik and Lepson, Mihkel and Kukk, Veronika and Malk, Maria and Ilves, Neeme and Kuulmets, Hele-Andra and Kolde, Raivo. Using synthetic health care data to leverage large language models for named entity recognition: Development and validation study. Journal of medical internet research

work page
[26]

Efficient memory management for large language model serving with PagedAttention

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion. Efficient memory management for large language model serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles

work page
[27]

Evaluating hospital course summarization by an electronic health record-based large language model

Small, William R and Austrian, Jonathan and O'Donnell, Luke and Burk-Rafel, Jesse and Hochman, Katherine A and Goodman, Adam and Zaretsky, Jonah and Martin, Jacob and Johnson, Stephen and Major, Vincent J and Jones, Simon and Henke, Christian and Verplanke, Benjamin and Osso, Jwan and Larson, Ian and Saxena, Archana and Mednick, Aron and Simonis, Choumika...

work page
[28]

DeBERTaV3 : Improving DeBERTa using ELECTRA -Style Pre-Training with Gradient-Disentangled Embedding Sharing

He, Pengcheng and Gao, Jianfeng and Chen, Weizhu. DeBERTaV3 : Improving DeBERTa using ELECTRA -Style Pre-Training with Gradient-Disentangled Embedding Sharing. The Eleventh International Conference on Learning Representations

work page
[29]

Generating synthetic clinical text with local large language models to identify misdiagnosed limb fractures in radiology reports

Liu, Jinghui and Koopman, Bevan and Brown, Nathan J and Chu, Kevin and Nguyen, Anthony. Generating synthetic clinical text with local large language models to identify misdiagnosed limb fractures in radiology reports. Artificial intelligence in medicine

work page
[30]

SMOG grading - A new readability formula

Harry, G and Laughlin, Mc. SMOG grading - A new readability formula. The Journal of Reading

work page
[31]

Large language models can support generation of standardized discharge summaries - A retrospective study utilizing ChatGPT -4 and electronic health records

Schwieger, Arne and Angst, Katrin and de Bardeci, Mateo and Burrer, Achim and Cathomas, Flurin and Ferrea, Stefano and Grätz, Franziska and Knorr, Marius and Kronenberg, Golo and Spiller, Tobias and Troi, David and Seifritz, Erich and Weber, Samantha and Olbrich, Sebastian. Large language models can support generation of standardized discharge summaries -...

work page
[32]

Equality of opportunity in supervised learning

Hardt, Moritz and Price, Eric and Srebro, Nathan. Equality of opportunity in supervised learning. Proceedings of the 30th International Conference on Neural Information Processing Systems

work page
[33]

DataComp - LM : In search of the next generation of training sets for language models

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir Yitzhak and Bansal, Hritik and Guha, Etash Kumar and Keh, Sedrick and Arora, Kushal and Garg, Saurabh and Xin, Rui and Muennighoff, Niklas and Heckel, Reinhard and Mercat, Jean and Chen, Mayee F and Gururangan, Suchin and Wortsman, Mitchell and Albalak, Alon ...

work page
[34]

Rephrasing Electronic Health Records for Pretraining Clinical Language Models

Liu, Jinghui and Nguyen, Anthony. Rephrasing Electronic Health Records for Pretraining Clinical Language Models. Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association

work page
[35]

Evaluating text complexity and Flesch-Kincaid grade level

Solnyshkina, M and Zamaletdinov, R and Gorodetskaya, L and Gabitov, A. Evaluating text complexity and Flesch-Kincaid grade level. Journal of social studies education research

work page
[36]

MIMIC- III , a freely accessible critical care database

Johnson, Alistair E W and Pollard, Tom J and Shen, Lu and Lehman, Li-Wei H and Feng, Mengling and Ghassemi, Mohammad and Moody, Benjamin and Szolovits, Peter and Celi, Leo Anthony and Mark, Roger G. MIMIC- III , a freely accessible critical care database. Scientific data

work page
[37]

Textbooks Are All You Need

Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti and Mendes, Caio César Teodoro and Del Giorno, Allie and Gopi, Sivakanth and Javaheripi, Mojan and Kauffmann, Piero and de Rosa, Gustavo and Saarikivi, Olli and Salim, Adil and Shah, Shital and Behl, Harkirat Singh and Wang, Xin and Bubeck, Sébastien and Eldan, Ronen and Kalai, Adam Tauman and Lee, Yin Tat ...

work page internal anchor Pith review Pith/arXiv arXiv
[38]

A Survey of Evaluation Metrics Used for NLG Systems

Sai, Ananya B and Mohankumar, Akash Kumar and Khapra, Mitesh M. A Survey of Evaluation Metrics Used for NLG Systems. ACM Comput. Surv

work page
[39]

Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data

Li, Rumeng and Wang, Xun and Yu, Hong. Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data. Findings of the Association for Computational Linguistics: EMNLP 2023

work page 2023
[40]

ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

Huang, Kexin and Altosaar, Jaan and Ranganath, Rajesh. ClinicalBERT : Modeling Clinical Notes and Predicting Hospital Readmission. arXiv [cs.CL]. arXiv:1904.05342

work page internal anchor Pith review Pith/arXiv arXiv 1904
[41]

Hierarchical label-wise attention transformer model for explainable ICD coding

Liu, Leibo and Perez-Concha, Oscar and Nguyen, Anthony and Bennett, Vicki and Jorm, Louisa. Hierarchical label-wise attention transformer model for explainable ICD coding. Journal of biomedical informatics

work page
[42]

Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction

Si, Yuqi and Roberts, Kirk. Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science

work page
[43]

MIMIC- IV , a freely accessible electronic health record dataset

Johnson, Alistair E W and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J and Moody, Benjamin and Gow, Brian and Lehman, Li-Wei H and Celi, Leo A and Mark, Roger G. MIMIC- IV , a freely accessible electronic health record dataset. Scientific data

work page
[44]

Unintended Consequences of Nationwide Electronic Health Record Adoption: Challenges and Opportunities in the Post-Meaningful Use Era

Colicchio, Tiago K and Cimino, James J and Del Fiol, Guilherme. Unintended Consequences of Nationwide Electronic Health Record Adoption: Challenges and Opportunities in the Post-Meaningful Use Era. Journal of medical Internet research

work page
[45]

Evaluating progress in automatic chest X -ray radiology report generation

Yu, Feiyang and Endo, Mark and Krishnan, Rayan and Pan, Ian and Tsai, Andy and Reis, Eduardo Pontes and Fonseca, Eduardo Kaiser Ururahy and Lee, Henrique Min Ho and Abad, Zahra Shakeri Hossein and Ng, Andrew Y and Langlotz, Curtis P and Venugopal, Vasantha Kumar and Rajpurkar, Pranav. Evaluating progress in automatic chest X -ray radiology report generati...

work page
[46]

``Note Bloat'' impacts deep learning-based NLP models for clinical prediction tasks

Liu, Jinghui and Capurro, Daniel and Nguyen, Anthony and Verspoor, Karin. ``Note Bloat'' impacts deep learning-based NLP models for clinical prediction tasks. Journal of biomedical informatics

work page
[47]

DRG - LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients

Wang, Hanyin and Gao, Chufan and Dantona, Christopher and Hull, Bryan and Sun, Jimeng. DRG - LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ digital medicine. arXiv:2309.12625

work page arXiv
[48]

PLM - ICD : Automatic ICD Coding with Pretrained Language Models

Huang, Chao-Wei and Tsai, Shang-Chi and Chen, Yun-Nung. PLM - ICD : Automatic ICD Coding with Pretrained Language Models. Proceedings of the 4th Clinical Natural Language Processing Workshop

work page
[49]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out

work page
[50]

Mistral 7B

Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, Lélio Renard and Lachaux, Marie-Anne and Stock, Pierre and Le Scao, Teven and Lavril, Thibaut and Wang, Thomas and Lacroix, Ti...

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Tang, Ruixiang and Han, Xiaotian and Jiang, Xiaoqian and Hu, Xia. Does Synthetic Data Generation of LLMs Help Clinical Text Mining?. arXiv [cs.CL]. arXiv:2303.04360

work page arXiv
[52]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Penedo, Guilherme and Kydlíček, Hynek and Allal, Loubna Ben and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Von Werra, Leandro and Wolf, Thomas. The FineWeb datasets: Decanting the web for the finest text data at scale. arXiv [cs.CL]. arXiv:2406.17557

work page internal anchor Pith review Pith/arXiv arXiv
[53]

BERTScore : Evaluating Text Generation with BERT

Zhang*, Tianyi and Kishore*, Varsha and Wu*, Felix and Weinberger, Kilian Q and Artzi, Yoav. BERTScore : Evaluating Text Generation with BERT. International Conference on Learning Representations

work page
[54]

Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

Kweon, Sunjun and Kim, Junu and Kim, Jiyoun and Im, Sujeong and Cho, Eunbyeol and Bae, Seongsu and Oh, Jungwoo and Lee, Gyubok and Moon, Jong Hak and You, Seng Chan and Baek, Seungjin and Han, Chang Hoon and Jung, Yoon Bin and Jo, Yohan and Choi, Edward. Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes. Findings of the As...

work page 2024
[55]

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Maini, Pratyush and Seto, Skyler and Bai, Richard and Grangier, David and Zhang, Yizhe and Jaitly, Navdeep. Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

work page
[56]

Discharge Me!

Xu, Justin and Chen, Zhihong and Johnston, Andrew and Blankemeier, Louis and Varma, Maya and Hom, Jason and Collins, William J and Modi, Ankit and Lloyd, Robert and Hopkins, Benjamin and Langlotz, Curtis and Delbrouck, Jean-Benoit. Overview of the First Shared Task on Clinical Text Generation: RRG24 and “Discharge Me!”. Proceedings of the 23rd Workshop on...

work page
[57]

Hard for humans, hard for machines: predicting readmission after psychiatric hospitalization using narrative notes

Boag, William and Kovaleva, Olga and McCoy, Jr, Thomas H and Rumshisky, Anna and Szolovits, Peter and Perlis, Roy H. Hard for humans, hard for machines: predicting readmission after psychiatric hospitalization using narrative notes. Translational psychiatry

work page
[58]

e-Health CSIRO at RRG24 : Entropy-Augmented Self-Critical Sequence Training for Radiology Report Generation

Nicolson, Aaron and Liu, Jinghui and Dowling, Jason and Nguyen, Anthony and Koopman, Bevan. e-Health CSIRO at RRG24 : Entropy-Augmented Self-Critical Sequence Training for Radiology Report Generation. Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

work page
[59]

Qwen2.5 Technical Report

Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...

work page internal anchor Pith review Pith/arXiv arXiv
[60]

RULER : What's the Real Context Size of Your Long-Context Language Models?

Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Ginsburg, Boris. RULER : What's the Real Context Size of Your Long-Context Language Models?. First Conference on Language Modeling

work page
[61]

Lost in the middle: How language models use long contexts

Liu, Nelson F and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics

work page
[62]

A unified review of deep learning for automated medical coding

Ji, Shaoxiong and Li, Xiaobo and Sun, Wei and Dong, Hang and Taalas, Ara and Zhang, Yijia and Wu, Honghan and Pitkänen, Esa and Marttinen, Pekka. A unified review of deep learning for automated medical coding. ACM computing surveys

work page
[63]

Current and future state of evaluation of large language models for medical summarization tasks

Croxford, Emma and Gao, Yanjun and Pellegrino, Nicholas and Wong, Karen and Wills, Graham and First, Elliot and Liao, Frank and Goswami, Cherodeep and Patterson, Brian and Afshar, Majid. Current and future state of evaluation of large language models for medical summarization tasks. npj Health Systems

work page
[64]

Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models

Gao, Yanjun and Dligach, Dmitriy and Miller, Timothy and Xu, Dongfang and Churpek, Matthew M M and Afshar, Majid. Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models. Proceedings of the 29th International Conference on Computational Linguistics

work page
[65]

Paraphrasing to improve the performance of Electronic Health Records Question Answering

Soni, Sarvesh and Roberts, Kirk. Paraphrasing to improve the performance of Electronic Health Records Question Answering. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science

work page
[66]

AI models collapse when trained on recursively generated data

Shumailov, Ilia and Shumaylov, Zakhar and Zhao, Yiren and Papernot, Nicolas and Anderson, Ross and Gal, Yarin. AI models collapse when trained on recursively generated data. Nature

work page
[67]

QuickUMLS : a fast, unsupervised approach for medical concept extraction

Soldaini, Luca. QuickUMLS : a fast, unsupervised approach for medical concept extraction. MedIR workshop, sigir

work page
[68]

Large language models for reducing clinicians' documentation burden

Roberts, Kirk. Large language models for reducing clinicians' documentation burden. Nature medicine

work page
[69]

A survey of bias in machine learning through the prism of statistical parity

Besse, Philippe and del Barrio, Eustasio and Gordaliza, Paula and Loubes, Jean-Michel and Risser, Laurent. A survey of bias in machine learning through the prism of statistical parity. The American statistician

work page
[70]

Aligning AI research with the needs of clinical coding workflows: Eight recommendations based on US data analysis and critical review

Gan, Yidong and Rybinski, Maciej and Hachey, Ben and Kummerfeld, Jonathan K. Aligning AI research with the needs of clinical coding workflows: Eight recommendations based on US data analysis and critical review. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

work page
[71]

DocLens : Multi-aspect fine-grained medical text evaluation

Xie, Yiqing and Zhang, Sheng and Cheng, Hao and Liu, Pengfei and Gero, Zelalem and Wong, Cliff and Naumann, Tristan and Poon, Hoifung and Rose, Carolyn. DocLens : Multi-aspect fine-grained medical text evaluation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

work page
[72]

Less is more: Explainable and efficient ICD code prediction with clinical entities

Douglas, James C and Gan, Yidong and Hachey, Ben and Kummerfeld, Jonathan K. Less is more: Explainable and efficient ICD code prediction with clinical entities. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

work page

[1] [1]

Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation

Carandang, Kristine Ann M and Arana, Jasper Meynard and Casin, Ethan Robert and Monterola, Christopher and Tan, Daniel Stanley and Valenzuela, Jesus Felix B and Alis, Christian. Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation. Proceedings of the 63rd Annual Meeting of the Association for Computatio...

work page

[2] [2]

FactEHR : A Dataset for Evaluating Factuality in Clinical Notes Using LLMs

Munnangi, Monica and Swaminathan, Akshay and Fries, Jason Alan and Jindal, Jenelle A and Narayanan, Sanjana and Lopez, Ivan and Tu, Lucia and Chung, Philip and Omiye, Jesutofunmi and Kashyap, Mehr and Shah, Nigam. FactEHR : A Dataset for Evaluating Factuality in Clinical Notes Using LLMs. Machine Learning for Healthcare Conference

work page

[3] [3]

Causal representation learning from multimodal clinical records under non-random modality missingness

Liang, Zihan and Pan, Ziwen and Xiong, Ruoxuan. Causal representation learning from multimodal clinical records under non-random modality missingness. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page 2025

[4] [4]

Factuality of large language models: A survey

Wang, Yuxia and Wang, Minghan and Manzoor, Muhammad Arslan and Liu, Fei and Georgiev, Georgi Nenkov and Das, Rocktim Jyoti and Nakov, Preslav. Factuality of large language models: A survey. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

work page 2024

[5] [5]

Generation of synthetic clinical text: A systematic review

Alshaikhdeeb, Basel and Hemedan, Ahmed Abdelmonem and Ghosh, Soumyabrata and Balaur, Irina and Satagopam, Venkata. Generation of synthetic clinical text: A systematic review. arXiv [cs.CL]. arXiv:2507.18451

work page arXiv

[6] [6]

A review on generative AI models for synthetic medical text, time series, and longitudinal data

Loni, Mohammad and Poursalim, Fatemeh and Asadi, Mehdi and Gharehbaghi, Arash. A review on generative AI models for synthetic medical text, time series, and longitudinal data. npj digital medicine

work page

[7] [7]

FActScore : Fine-grained atomic evaluation of factual precision in long form text generation

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-Tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FActScore : Fine-grained atomic evaluation of factual precision in long form text generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023

[8] [8]

Synth- SBDH : A synthetic dataset of social and behavioral determinants of health for clinical text

Mitra, Avijit and Yang, Zhichao and Druhl, Emily and Goodwin, Raelene and Yu, Hong. Synth- SBDH : A synthetic dataset of social and behavioral determinants of health for clinical text. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page 2025

[9] [9]

An evaluation of synthetic data augmentation for mitigating covariate bias in health data

Juwara, Lamin and El-Hussuna, Alaa and El Emam, Khaled. An evaluation of synthetic data augmentation for mitigating covariate bias in health data. Patterns (New York, N.Y.)

work page

[10] [10]

Demystifying synthetic data in LLM pre-training: A systematic study of scaling laws, benefits, and pitfalls

Kang, Feiyang and Ardalani, Newsha and Kuchnik, Michael and Emad, Youssef and Elhoushi, Mostafa and Sengupta, Shubhabrata and Li, Shang-Wen and Raghavendra, Ramya and Jia, Ruoxi and Wu, Carole-Jean. Demystifying synthetic data in LLM pre-training: A systematic study of scaling laws, benefits, and pitfalls. Proceedings of the 2025 Conference on Empirical M...

work page 2025

[11] [11]

Generative models improve fairness of medical classifiers under distribution shifts

Ktena, Ira and Wiles, Olivia and Albuquerque, Isabela and Rebuffi, Sylvestre-Alvise and Tanno, Ryutaro and Roy, Abhijit Guha and Azizi, Shekoofeh and Belgrave, Danielle and Kohli, Pushmeet and Cemgil, Taylan and Karthikesalingam, Alan and Gowal, Sven. Generative models improve fairness of medical classifiers under distribution shifts. Nature Medicine

work page

[12] [12]

NoteChat : A dataset of synthetic patient-physician conversations conditioned on clinical notes

Wang, Junda and Yao, Zonghai and Yang, Zhichao and Zhou, Huixue and Li, Rumeng and Wang, Xun and Xu, Yucheng and Yu, Hong. NoteChat : A dataset of synthetic patient-physician conversations conditioned on clinical notes. Findings of the Association for Computational Linguistics ACL 2024

work page 2024

[13] [13]

BeyondWeb : Lessons from scaling synthetic data for trillion-scale pretraining

Maini, Pratyush and Dorna, Vineeth and Doshi, Parth and Carranza, Aldo and Pan, Fan and Urbanek, Jack and Burstein, Paul and Fang, Alex and Deng, Alvin and Abbas, Amro and Larsen, Brett and Blakeney, Cody and Bannur, Charvi and Baek, Christina and Teh, Darren and Schwab, David and Mongstad, Haakon and Yin, Haoli and Wills, Josh and Mentzer, Kaleigh and Me...

work page arXiv

[14] [14]

Verifying facts in patient care documents generated by large language models using electronic health records

Chung, Philip and Swaminathan, Akshay and Goodell, Alex J and Kim, Yeasul and Momsen Reincke, S and Han, Lichy and Deverett, Ben and Sadeghi, Mohammad Amin and Ariss, Abdel-Badih and Ghanem, Marc and Seong, David and Lee, Andrew A and Coombes, Caitlin E and Bradshaw, Brad and Sufian, Mahir A and Hong, Hyo Jung and Nguyen, Teresa P and Rasouli, Mohammad R ...

work page

[15] [15]

Evaluation of electronic health record-integrated artificial intelligence chart review

Kahl, Nicolas M and Frieden, Marshall J and Pope, Zach R and Millen, Marlene M and Tolia, Vaishal M and Chan, Theodore C and Longhurst, Christopher A and Singh, Karandeep and You, Alan X. Evaluation of electronic health record-integrated artificial intelligence chart review. Npj Health Systems

work page

[16] [16]

Assisting in writing Wikipedia-like articles from scratch with large language models

Shao, Yijia and Jiang, Yucheng and Kanell, Theodore and Xu, Peter and Khattab, Omar and Lam, Monica. Assisting in writing Wikipedia-like articles from scratch with large language models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

work page 2024

[17] [17]

Toward relieving clinician burden by automatically generating progress notes using interim hospital data

Soni, Sarvesh and Demner-Fushman, Dina. Toward relieving clinician burden by automatically generating progress notes using interim hospital data. AMIA Annual Symposium Proceedings

work page

[18] [18]

Health system-scale language models are all-purpose prediction engines

Jiang, Lavender Yao and Liu, Xujin Chris and Nejatian, Nima Pour and Nasir-Moin, Mustafa and Wang, Duo and Abidin, Anas and Eaton, Kevin and Riina, Howard Antony and Laufer, Ilya and Punjabi, Paawan and Miceli, Madeline and Kim, Nora C and Orillac, Cordelia and Schnurman, Zane and Livia, Christopher and Weiss, Hannah and Kurland, David and Neifert, Sean a...

work page

[19] [19]

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

Lewis, Patrick and Ott, Myle and Du, Jingfei and Stoyanov, Veselin. Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art. Proceedings of the 3rd Clinical Natural Language Processing Workshop

work page

[20] [20]

Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study

Edin, Joakim and Junge, Alexander and Havtorn, Jakob D and Borgholt, Lasse and Maistro, Maria and Ruotsalo, Tuukka and Maaløe, Lars. Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

work page

[21] [21]

Adapted large language models can outperform medical experts in clinical text summarization

Van Veen, Dave and Van Uden, Cara and Blankemeier, Louis and Delbrouck, Jean-Benoit and Aali, Asad and Bluethgen, Christian and Pareek, Anuj and Polacin, Malgorzata and Reis, Eduardo Pontes and Seehofnerová, Anna and Rohatgi, Nidhi and Hosamani, Poonam and Collins, William and Ahuja, Neera and Langlotz, Curtis P and Hom, Jason and Gatidis, Sergios and Pau...

work page

[22] [22]

Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

Yim, Wen-Wai and Fu, Yujuan and Ben Abacha, Asma and Snider, Neal and Lin, Thomas and Yetisgen, Meliha. Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Scientific data

work page

[23] [23]

The Llama 3 Herd of Models

Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and Goyal, Anirudh and Hartshorn, Anthony and Yang, Aobo and Mitra, Archi and Sravankumar, Archie and Korenev, Artem and Hinsvark, Arthur and Rao, Arun and Zhang, Aston and ...

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Large language models are less effective at clinical prediction tasks than locally trained machine learning models

Brown, Katherine E and Yan, Chao and Li, Zhuohang and Zhang, Xinmeng and Collins, Benjamin X and Chen, You and Clayton, Ellen Wright and Kantarcioglu, Murat and Vorobeychik, Yevgeniy and Malin, Bradley A. Large language models are less effective at clinical prediction tasks than locally trained machine learning models. Journal of the American Medical Info...

work page

[25] [25]

Using synthetic health care data to leverage large language models for named entity recognition: Development and validation study

Šuvalov, Hendrik and Lepson, Mihkel and Kukk, Veronika and Malk, Maria and Ilves, Neeme and Kuulmets, Hele-Andra and Kolde, Raivo. Using synthetic health care data to leverage large language models for named entity recognition: Development and validation study. Journal of medical internet research

work page

[26] [26]

Efficient memory management for large language model serving with PagedAttention

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion. Efficient memory management for large language model serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles

work page

[27] [27]

Evaluating hospital course summarization by an electronic health record-based large language model

Small, William R and Austrian, Jonathan and O'Donnell, Luke and Burk-Rafel, Jesse and Hochman, Katherine A and Goodman, Adam and Zaretsky, Jonah and Martin, Jacob and Johnson, Stephen and Major, Vincent J and Jones, Simon and Henke, Christian and Verplanke, Benjamin and Osso, Jwan and Larson, Ian and Saxena, Archana and Mednick, Aron and Simonis, Choumika...

work page

[28] [28]

DeBERTaV3 : Improving DeBERTa using ELECTRA -Style Pre-Training with Gradient-Disentangled Embedding Sharing

He, Pengcheng and Gao, Jianfeng and Chen, Weizhu. DeBERTaV3 : Improving DeBERTa using ELECTRA -Style Pre-Training with Gradient-Disentangled Embedding Sharing. The Eleventh International Conference on Learning Representations

work page

[29] [29]

Generating synthetic clinical text with local large language models to identify misdiagnosed limb fractures in radiology reports

Liu, Jinghui and Koopman, Bevan and Brown, Nathan J and Chu, Kevin and Nguyen, Anthony. Generating synthetic clinical text with local large language models to identify misdiagnosed limb fractures in radiology reports. Artificial intelligence in medicine

work page

[30] [30]

SMOG grading - A new readability formula

Harry, G and Laughlin, Mc. SMOG grading - A new readability formula. The Journal of Reading

work page

[31] [31]

Large language models can support generation of standardized discharge summaries - A retrospective study utilizing ChatGPT -4 and electronic health records

Schwieger, Arne and Angst, Katrin and de Bardeci, Mateo and Burrer, Achim and Cathomas, Flurin and Ferrea, Stefano and Grätz, Franziska and Knorr, Marius and Kronenberg, Golo and Spiller, Tobias and Troi, David and Seifritz, Erich and Weber, Samantha and Olbrich, Sebastian. Large language models can support generation of standardized discharge summaries -...

work page

[32] [32]

Equality of opportunity in supervised learning

Hardt, Moritz and Price, Eric and Srebro, Nathan. Equality of opportunity in supervised learning. Proceedings of the 30th International Conference on Neural Information Processing Systems

work page

[33] [33]

DataComp - LM : In search of the next generation of training sets for language models

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir Yitzhak and Bansal, Hritik and Guha, Etash Kumar and Keh, Sedrick and Arora, Kushal and Garg, Saurabh and Xin, Rui and Muennighoff, Niklas and Heckel, Reinhard and Mercat, Jean and Chen, Mayee F and Gururangan, Suchin and Wortsman, Mitchell and Albalak, Alon ...

work page

[34] [34]

Rephrasing Electronic Health Records for Pretraining Clinical Language Models

Liu, Jinghui and Nguyen, Anthony. Rephrasing Electronic Health Records for Pretraining Clinical Language Models. Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association

work page

[35] [35]

Evaluating text complexity and Flesch-Kincaid grade level

Solnyshkina, M and Zamaletdinov, R and Gorodetskaya, L and Gabitov, A. Evaluating text complexity and Flesch-Kincaid grade level. Journal of social studies education research

work page

[36] [36]

MIMIC- III , a freely accessible critical care database

Johnson, Alistair E W and Pollard, Tom J and Shen, Lu and Lehman, Li-Wei H and Feng, Mengling and Ghassemi, Mohammad and Moody, Benjamin and Szolovits, Peter and Celi, Leo Anthony and Mark, Roger G. MIMIC- III , a freely accessible critical care database. Scientific data

work page

[37] [37]

Textbooks Are All You Need

Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti and Mendes, Caio César Teodoro and Del Giorno, Allie and Gopi, Sivakanth and Javaheripi, Mojan and Kauffmann, Piero and de Rosa, Gustavo and Saarikivi, Olli and Salim, Adil and Shah, Shital and Behl, Harkirat Singh and Wang, Xin and Bubeck, Sébastien and Eldan, Ronen and Kalai, Adam Tauman and Lee, Yin Tat ...

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

A Survey of Evaluation Metrics Used for NLG Systems

Sai, Ananya B and Mohankumar, Akash Kumar and Khapra, Mitesh M. A Survey of Evaluation Metrics Used for NLG Systems. ACM Comput. Surv

work page

[39] [39]

Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data

Li, Rumeng and Wang, Xun and Yu, Hong. Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data. Findings of the Association for Computational Linguistics: EMNLP 2023

work page 2023

[40] [40]

ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

Huang, Kexin and Altosaar, Jaan and Ranganath, Rajesh. ClinicalBERT : Modeling Clinical Notes and Predicting Hospital Readmission. arXiv [cs.CL]. arXiv:1904.05342

work page internal anchor Pith review Pith/arXiv arXiv 1904

[41] [41]

Hierarchical label-wise attention transformer model for explainable ICD coding

Liu, Leibo and Perez-Concha, Oscar and Nguyen, Anthony and Bennett, Vicki and Jorm, Louisa. Hierarchical label-wise attention transformer model for explainable ICD coding. Journal of biomedical informatics

work page

[42] [42]

Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction

Si, Yuqi and Roberts, Kirk. Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science

work page

[43] [43]

MIMIC- IV , a freely accessible electronic health record dataset

Johnson, Alistair E W and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J and Moody, Benjamin and Gow, Brian and Lehman, Li-Wei H and Celi, Leo A and Mark, Roger G. MIMIC- IV , a freely accessible electronic health record dataset. Scientific data

work page

[44] [44]

Unintended Consequences of Nationwide Electronic Health Record Adoption: Challenges and Opportunities in the Post-Meaningful Use Era

Colicchio, Tiago K and Cimino, James J and Del Fiol, Guilherme. Unintended Consequences of Nationwide Electronic Health Record Adoption: Challenges and Opportunities in the Post-Meaningful Use Era. Journal of medical Internet research

work page

[45] [45]

Evaluating progress in automatic chest X -ray radiology report generation

Yu, Feiyang and Endo, Mark and Krishnan, Rayan and Pan, Ian and Tsai, Andy and Reis, Eduardo Pontes and Fonseca, Eduardo Kaiser Ururahy and Lee, Henrique Min Ho and Abad, Zahra Shakeri Hossein and Ng, Andrew Y and Langlotz, Curtis P and Venugopal, Vasantha Kumar and Rajpurkar, Pranav. Evaluating progress in automatic chest X -ray radiology report generati...

work page

[46] [46]

``Note Bloat'' impacts deep learning-based NLP models for clinical prediction tasks

Liu, Jinghui and Capurro, Daniel and Nguyen, Anthony and Verspoor, Karin. ``Note Bloat'' impacts deep learning-based NLP models for clinical prediction tasks. Journal of biomedical informatics

work page

[47] [47]

DRG - LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients

Wang, Hanyin and Gao, Chufan and Dantona, Christopher and Hull, Bryan and Sun, Jimeng. DRG - LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ digital medicine. arXiv:2309.12625

work page arXiv

[48] [48]

PLM - ICD : Automatic ICD Coding with Pretrained Language Models

Huang, Chao-Wei and Tsai, Shang-Chi and Chen, Yun-Nung. PLM - ICD : Automatic ICD Coding with Pretrained Language Models. Proceedings of the 4th Clinical Natural Language Processing Workshop

work page

[49] [49]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out

work page

[50] [50]

Mistral 7B

Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, Lélio Renard and Lachaux, Marie-Anne and Stock, Pierre and Le Scao, Teven and Lavril, Thibaut and Wang, Thomas and Lacroix, Ti...

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Tang, Ruixiang and Han, Xiaotian and Jiang, Xiaoqian and Hu, Xia. Does Synthetic Data Generation of LLMs Help Clinical Text Mining?. arXiv [cs.CL]. arXiv:2303.04360

work page arXiv

[52] [52]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Penedo, Guilherme and Kydlíček, Hynek and Allal, Loubna Ben and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Von Werra, Leandro and Wolf, Thomas. The FineWeb datasets: Decanting the web for the finest text data at scale. arXiv [cs.CL]. arXiv:2406.17557

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

BERTScore : Evaluating Text Generation with BERT

Zhang*, Tianyi and Kishore*, Varsha and Wu*, Felix and Weinberger, Kilian Q and Artzi, Yoav. BERTScore : Evaluating Text Generation with BERT. International Conference on Learning Representations

work page

[54] [54]

Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

Kweon, Sunjun and Kim, Junu and Kim, Jiyoun and Im, Sujeong and Cho, Eunbyeol and Bae, Seongsu and Oh, Jungwoo and Lee, Gyubok and Moon, Jong Hak and You, Seng Chan and Baek, Seungjin and Han, Chang Hoon and Jung, Yoon Bin and Jo, Yohan and Choi, Edward. Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes. Findings of the As...

work page 2024

[55] [55]

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Maini, Pratyush and Seto, Skyler and Bai, Richard and Grangier, David and Zhang, Yizhe and Jaitly, Navdeep. Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

work page

[56] [56]

Discharge Me!

Xu, Justin and Chen, Zhihong and Johnston, Andrew and Blankemeier, Louis and Varma, Maya and Hom, Jason and Collins, William J and Modi, Ankit and Lloyd, Robert and Hopkins, Benjamin and Langlotz, Curtis and Delbrouck, Jean-Benoit. Overview of the First Shared Task on Clinical Text Generation: RRG24 and “Discharge Me!”. Proceedings of the 23rd Workshop on...

work page

[57] [57]

Hard for humans, hard for machines: predicting readmission after psychiatric hospitalization using narrative notes

Boag, William and Kovaleva, Olga and McCoy, Jr, Thomas H and Rumshisky, Anna and Szolovits, Peter and Perlis, Roy H. Hard for humans, hard for machines: predicting readmission after psychiatric hospitalization using narrative notes. Translational psychiatry

work page

[58] [58]

e-Health CSIRO at RRG24 : Entropy-Augmented Self-Critical Sequence Training for Radiology Report Generation

Nicolson, Aaron and Liu, Jinghui and Dowling, Jason and Nguyen, Anthony and Koopman, Bevan. e-Health CSIRO at RRG24 : Entropy-Augmented Self-Critical Sequence Training for Radiology Report Generation. Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

work page

[59] [59]

Qwen2.5 Technical Report

Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

RULER : What's the Real Context Size of Your Long-Context Language Models?

Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Ginsburg, Boris. RULER : What's the Real Context Size of Your Long-Context Language Models?. First Conference on Language Modeling

work page

[61] [61]

Lost in the middle: How language models use long contexts

Liu, Nelson F and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics

work page

[62] [62]

A unified review of deep learning for automated medical coding

Ji, Shaoxiong and Li, Xiaobo and Sun, Wei and Dong, Hang and Taalas, Ara and Zhang, Yijia and Wu, Honghan and Pitkänen, Esa and Marttinen, Pekka. A unified review of deep learning for automated medical coding. ACM computing surveys

work page

[63] [63]

Current and future state of evaluation of large language models for medical summarization tasks

Croxford, Emma and Gao, Yanjun and Pellegrino, Nicholas and Wong, Karen and Wills, Graham and First, Elliot and Liao, Frank and Goswami, Cherodeep and Patterson, Brian and Afshar, Majid. Current and future state of evaluation of large language models for medical summarization tasks. npj Health Systems

work page

[64] [64]

Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models

Gao, Yanjun and Dligach, Dmitriy and Miller, Timothy and Xu, Dongfang and Churpek, Matthew M M and Afshar, Majid. Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models. Proceedings of the 29th International Conference on Computational Linguistics

work page

[65] [65]

Paraphrasing to improve the performance of Electronic Health Records Question Answering

Soni, Sarvesh and Roberts, Kirk. Paraphrasing to improve the performance of Electronic Health Records Question Answering. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science

work page

[66] [66]

AI models collapse when trained on recursively generated data

Shumailov, Ilia and Shumaylov, Zakhar and Zhao, Yiren and Papernot, Nicolas and Anderson, Ross and Gal, Yarin. AI models collapse when trained on recursively generated data. Nature

work page

[67] [67]

QuickUMLS : a fast, unsupervised approach for medical concept extraction

Soldaini, Luca. QuickUMLS : a fast, unsupervised approach for medical concept extraction. MedIR workshop, sigir

work page

[68] [68]

Large language models for reducing clinicians' documentation burden

Roberts, Kirk. Large language models for reducing clinicians' documentation burden. Nature medicine

work page

[69] [69]

A survey of bias in machine learning through the prism of statistical parity

Besse, Philippe and del Barrio, Eustasio and Gordaliza, Paula and Loubes, Jean-Michel and Risser, Laurent. A survey of bias in machine learning through the prism of statistical parity. The American statistician

work page

[70] [70]

Aligning AI research with the needs of clinical coding workflows: Eight recommendations based on US data analysis and critical review

Gan, Yidong and Rybinski, Maciej and Hachey, Ben and Kummerfeld, Jonathan K. Aligning AI research with the needs of clinical coding workflows: Eight recommendations based on US data analysis and critical review. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

work page

[71] [71]

DocLens : Multi-aspect fine-grained medical text evaluation

Xie, Yiqing and Zhang, Sheng and Cheng, Hao and Liu, Pengfei and Gero, Zelalem and Wong, Cliff and Naumann, Tristan and Poon, Hoifung and Rose, Carolyn. DocLens : Multi-aspect fine-grained medical text evaluation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

work page

[72] [72]

Less is more: Explainable and efficient ICD code prediction with clinical entities

Douglas, James C and Gan, Yidong and Hachey, Ben and Kummerfeld, Jonathan K. Less is more: Explainable and efficient ICD code prediction with clinical entities. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

work page