Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale
Pith reviewed 2026-05-20 11:40 UTC · model grok-4.3
The pith
LLM-rephrased clinical notes keep core information for broad tasks but lose details needed for ICD coding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through experiments at the million-note scale on data from the MIMIC databases, the work demonstrates that synthetic notes created by rephrasing with LLMs preserve core clinical information and the ability to make useful predictions on coarse-grained tasks, despite large shifts in language. For finer tasks such as ICD coding, however, important details are lost. Rephrasing the notes in chunks rather than as complete documents substantially reduces this loss of detail, but this comes at the expense of lower factual precision when the rephrasing model works with incomplete surrounding context. Synthesis errors turn out to be driven mainly by misinterpretation of clinical context, together with
What carries the argument
Multi-aspect evaluation at million-note scale that combines intrinsic similarity, extrinsic task utility, and factuality assessment on LLM-rephrased notes from clinical databases.
If this is right
- Synthetic notes can augment training sets for rare ICD codes without being tailored to that task.
- Chunk-based rephrasing recovers more fine-grained clinical details than full-note rephrasing.
- Full context during rephrasing improves factual precision compared to incomplete context.
- Most errors arise from context misinterpretation rather than other fabrication types.
Where Pith is reading between the lines
- These results suggest testing hybrid rephrasing strategies that mix chunk and full-note approaches for different note sections.
- Similar evaluations could apply to other medical text types such as discharge summaries or radiology reports.
- The approach opens possibilities for generating privacy-preserving synthetic datasets for broader medical AI development.
Load-bearing premise
The chosen intrinsic, extrinsic, and factuality metrics together give a complete enough picture of synthetic note quality for all downstream clinical applications.
What would settle it
A direct comparison showing that chunk-rephrased notes do not improve ICD coding performance over whole-note rephrased notes on a large test set would falsify the mitigation benefit of chunking.
Figures
read the original abstract
Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a large-scale (million-note) evaluation of LLM-rephrased synthetic clinical notes derived from MIMIC data, combining intrinsic similarity metrics, extrinsic predictive utility tests (including ICD coding), and factuality/error analysis. It claims that synthetic notes retain core clinical information and utility for coarse-grained tasks despite linguistic divergence, but lose fine-grained details; chunk-wise rephrasing substantially mitigates the detail loss for tasks such as ICD coding, albeit at the expense of reduced factual precision under incomplete context. Synthesis errors are dominated by context misinterpretation, temporal confusion, measurement errors, and fabricated claims, and the notes can augment training data for rare ICD codes.
Significance. If the central claims hold, the work supplies a practical, multi-faceted evaluation framework for synthetic clinical text at unprecedented scale and demonstrates a concrete mitigation strategy (chunk rephrasing) that could improve utility for data augmentation in clinical NLP, particularly for rare-event modeling. The million-note scale and complementary intrinsic/extrinsic/factuality lens are genuine strengths that go beyond typical narrow similarity or utility studies.
major comments (2)
- [Abstract / Extrinsic evaluation] Abstract and extrinsic-evaluation section: the headline claim that chunk rephrasing 'substantially mitigates' loss of fine-grained detail rests on ICD-coding accuracy as the sole reported extrinsic proxy for detail preservation. Given that the paper's own error analysis identifies temporal confusion and context misinterpretation as dominant failure modes—precisely the phenomena that incomplete chunk context could exacerbate—the mitigation benefit may be task-specific rather than general. Additional fine-grained tasks (medication reconciliation, procedure sequencing, temporal ordering) are needed to support the broader claim.
- [Methods] Methods / chunk-rephrasing description: chunk size is listed as a free parameter with no sensitivity analysis or pre-specified justification provided. Because the mitigation result for ICD coding is shown only for the chosen chunking regime, it is unclear whether the reported improvement is robust or post-hoc tuned; a load-bearing sensitivity table or ablation across chunk sizes would be required to substantiate the strategy.
minor comments (2)
- [Abstract] The abstract states that the three evaluation families are 'complementary and best viewed in parallel,' yet no quantitative integration or joint statistical test across intrinsic, extrinsic, and factuality scores is reported; readers cannot easily judge overall quality trade-offs.
- [Results] Error bars, confidence intervals, or statistical significance tests are absent from the reported metric comparisons; given the million-note scale, even modest effect sizes should be accompanied by uncertainty estimates.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. We appreciate the recognition of the scale and multi-faceted nature of our evaluation. Below, we provide point-by-point responses to the major comments and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract / Extrinsic evaluation] Abstract and extrinsic-evaluation section: the headline claim that chunk rephrasing 'substantially mitigates' loss of fine-grained detail rests on ICD-coding accuracy as the sole reported extrinsic proxy for detail preservation. Given that the paper's own error analysis identifies temporal confusion and context misinterpretation as dominant failure modes—precisely the phenomena that incomplete chunk context could exacerbate—the mitigation benefit may be task-specific rather than general. Additional fine-grained tasks (medication reconciliation, procedure sequencing, temporal ordering) are needed to support the broader claim.
Authors: We acknowledge that our extrinsic evaluation for fine-grained detail preservation relies primarily on ICD coding accuracy as a proxy. ICD coding is a clinically relevant task that requires precise extraction of specific diagnostic information from the notes, making it a strong indicator of detail retention. However, we agree that demonstrating the mitigation effect on additional tasks would strengthen the claim. In the revised manuscript, we will expand the discussion to address why ICD coding serves as a representative task for fine-grained clinical details and include a limitations section noting that the benefits may vary across tasks. If space and resources permit, we will consider adding preliminary results for one additional task such as medication reconciliation. revision: partial
-
Referee: [Methods] Methods / chunk-rephrasing description: chunk size is listed as a free parameter with no sensitivity analysis or pre-specified justification provided. Because the mitigation result for ICD coding is shown only for the chosen chunking regime, it is unclear whether the reported improvement is robust or post-hoc tuned; a load-bearing sensitivity table or ablation across chunk sizes would be required to substantiate the strategy.
Authors: We agree that a sensitivity analysis on chunk size would improve the robustness of our findings. The chunk size was chosen based on preliminary experiments to balance context completeness with computational efficiency, but we did not include the full ablation in the original submission. In the revision, we will add a sensitivity analysis table showing ICD coding performance across a range of chunk sizes (e.g., 500, 1000, 2000 tokens) to demonstrate that the improvement is consistent and not tuned to a specific value. revision: yes
Circularity Check
No circularity in empirical evaluation of synthetic clinical notes
full rationale
The paper performs direct empirical comparisons of LLM-rephrased notes against original MIMIC data using intrinsic similarity metrics, extrinsic task performance (e.g., ICD coding), and factuality checks at scale. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the reported results. Central claims rest on observable differences in linguistic changes, detail preservation, and error modes rather than any quantity defined in terms of itself or reduced by construction to the input data.
Axiom & Free-Parameter Ledger
free parameters (1)
- chunk size for rephrasing
axioms (1)
- domain assumption MIMIC clinical notes are representative of real-world clinical documentation for evaluation purposes
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
systematic evaluation ... intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
loss of fine-grained details ... mitigated by rephrasing notes by chunks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Carandang, Kristine Ann M and Arana, Jasper Meynard and Casin, Ethan Robert and Monterola, Christopher and Tan, Daniel Stanley and Valenzuela, Jesus Felix B and Alis, Christian. Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation. Proceedings of the 63rd Annual Meeting of the Association for Computatio...
-
[2]
FactEHR : A Dataset for Evaluating Factuality in Clinical Notes Using LLMs
Munnangi, Monica and Swaminathan, Akshay and Fries, Jason Alan and Jindal, Jenelle A and Narayanan, Sanjana and Lopez, Ivan and Tu, Lucia and Chung, Philip and Omiye, Jesutofunmi and Kashyap, Mehr and Shah, Nigam. FactEHR : A Dataset for Evaluating Factuality in Clinical Notes Using LLMs. Machine Learning for Healthcare Conference
-
[3]
Liang, Zihan and Pan, Ziwen and Xiong, Ruoxuan. Causal representation learning from multimodal clinical records under non-random modality missingness. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
work page 2025
-
[4]
Factuality of large language models: A survey
Wang, Yuxia and Wang, Minghan and Manzoor, Muhammad Arslan and Liu, Fei and Georgiev, Georgi Nenkov and Das, Rocktim Jyoti and Nakov, Preslav. Factuality of large language models: A survey. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
work page 2024
-
[5]
Generation of synthetic clinical text: A systematic review
Alshaikhdeeb, Basel and Hemedan, Ahmed Abdelmonem and Ghosh, Soumyabrata and Balaur, Irina and Satagopam, Venkata. Generation of synthetic clinical text: A systematic review. arXiv [cs.CL]. arXiv:2507.18451
-
[6]
A review on generative AI models for synthetic medical text, time series, and longitudinal data
Loni, Mohammad and Poursalim, Fatemeh and Asadi, Mehdi and Gharehbaghi, Arash. A review on generative AI models for synthetic medical text, time series, and longitudinal data. npj digital medicine
-
[7]
FActScore : Fine-grained atomic evaluation of factual precision in long form text generation
Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-Tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FActScore : Fine-grained atomic evaluation of factual precision in long form text generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
work page 2023
-
[8]
Synth- SBDH : A synthetic dataset of social and behavioral determinants of health for clinical text
Mitra, Avijit and Yang, Zhichao and Druhl, Emily and Goodwin, Raelene and Yu, Hong. Synth- SBDH : A synthetic dataset of social and behavioral determinants of health for clinical text. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
work page 2025
-
[9]
An evaluation of synthetic data augmentation for mitigating covariate bias in health data
Juwara, Lamin and El-Hussuna, Alaa and El Emam, Khaled. An evaluation of synthetic data augmentation for mitigating covariate bias in health data. Patterns (New York, N.Y.)
-
[10]
Kang, Feiyang and Ardalani, Newsha and Kuchnik, Michael and Emad, Youssef and Elhoushi, Mostafa and Sengupta, Shubhabrata and Li, Shang-Wen and Raghavendra, Ramya and Jia, Ruoxi and Wu, Carole-Jean. Demystifying synthetic data in LLM pre-training: A systematic study of scaling laws, benefits, and pitfalls. Proceedings of the 2025 Conference on Empirical M...
work page 2025
-
[11]
Generative models improve fairness of medical classifiers under distribution shifts
Ktena, Ira and Wiles, Olivia and Albuquerque, Isabela and Rebuffi, Sylvestre-Alvise and Tanno, Ryutaro and Roy, Abhijit Guha and Azizi, Shekoofeh and Belgrave, Danielle and Kohli, Pushmeet and Cemgil, Taylan and Karthikesalingam, Alan and Gowal, Sven. Generative models improve fairness of medical classifiers under distribution shifts. Nature Medicine
-
[12]
NoteChat : A dataset of synthetic patient-physician conversations conditioned on clinical notes
Wang, Junda and Yao, Zonghai and Yang, Zhichao and Zhou, Huixue and Li, Rumeng and Wang, Xun and Xu, Yucheng and Yu, Hong. NoteChat : A dataset of synthetic patient-physician conversations conditioned on clinical notes. Findings of the Association for Computational Linguistics ACL 2024
work page 2024
-
[13]
BeyondWeb : Lessons from scaling synthetic data for trillion-scale pretraining
Maini, Pratyush and Dorna, Vineeth and Doshi, Parth and Carranza, Aldo and Pan, Fan and Urbanek, Jack and Burstein, Paul and Fang, Alex and Deng, Alvin and Abbas, Amro and Larsen, Brett and Blakeney, Cody and Bannur, Charvi and Baek, Christina and Teh, Darren and Schwab, David and Mongstad, Haakon and Yin, Haoli and Wills, Josh and Mentzer, Kaleigh and Me...
-
[14]
Chung, Philip and Swaminathan, Akshay and Goodell, Alex J and Kim, Yeasul and Momsen Reincke, S and Han, Lichy and Deverett, Ben and Sadeghi, Mohammad Amin and Ariss, Abdel-Badih and Ghanem, Marc and Seong, David and Lee, Andrew A and Coombes, Caitlin E and Bradshaw, Brad and Sufian, Mahir A and Hong, Hyo Jung and Nguyen, Teresa P and Rasouli, Mohammad R ...
-
[15]
Evaluation of electronic health record-integrated artificial intelligence chart review
Kahl, Nicolas M and Frieden, Marshall J and Pope, Zach R and Millen, Marlene M and Tolia, Vaishal M and Chan, Theodore C and Longhurst, Christopher A and Singh, Karandeep and You, Alan X. Evaluation of electronic health record-integrated artificial intelligence chart review. Npj Health Systems
-
[16]
Assisting in writing Wikipedia-like articles from scratch with large language models
Shao, Yijia and Jiang, Yucheng and Kanell, Theodore and Xu, Peter and Khattab, Omar and Lam, Monica. Assisting in writing Wikipedia-like articles from scratch with large language models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
work page 2024
-
[17]
Soni, Sarvesh and Demner-Fushman, Dina. Toward relieving clinician burden by automatically generating progress notes using interim hospital data. AMIA Annual Symposium Proceedings
-
[18]
Health system-scale language models are all-purpose prediction engines
Jiang, Lavender Yao and Liu, Xujin Chris and Nejatian, Nima Pour and Nasir-Moin, Mustafa and Wang, Duo and Abidin, Anas and Eaton, Kevin and Riina, Howard Antony and Laufer, Ilya and Punjabi, Paawan and Miceli, Madeline and Kim, Nora C and Orillac, Cordelia and Schnurman, Zane and Livia, Christopher and Weiss, Hannah and Kurland, David and Neifert, Sean a...
-
[19]
Lewis, Patrick and Ott, Myle and Du, Jingfei and Stoyanov, Veselin. Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art. Proceedings of the 3rd Clinical Natural Language Processing Workshop
-
[20]
Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study
Edin, Joakim and Junge, Alexander and Havtorn, Jakob D and Borgholt, Lasse and Maistro, Maria and Ruotsalo, Tuukka and Maaløe, Lars. Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
-
[21]
Adapted large language models can outperform medical experts in clinical text summarization
Van Veen, Dave and Van Uden, Cara and Blankemeier, Louis and Delbrouck, Jean-Benoit and Aali, Asad and Bluethgen, Christian and Pareek, Anuj and Polacin, Malgorzata and Reis, Eduardo Pontes and Seehofnerová, Anna and Rohatgi, Nidhi and Hosamani, Poonam and Collins, William and Ahuja, Neera and Langlotz, Curtis P and Hom, Jason and Gatidis, Sergios and Pau...
-
[22]
Yim, Wen-Wai and Fu, Yujuan and Ben Abacha, Asma and Snider, Neal and Lin, Thomas and Yetisgen, Meliha. Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Scientific data
-
[23]
Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and Goyal, Anirudh and Hartshorn, Anthony and Yang, Aobo and Mitra, Archi and Sravankumar, Archie and Korenev, Artem and Hinsvark, Arthur and Rao, Arun and Zhang, Aston and ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Brown, Katherine E and Yan, Chao and Li, Zhuohang and Zhang, Xinmeng and Collins, Benjamin X and Chen, You and Clayton, Ellen Wright and Kantarcioglu, Murat and Vorobeychik, Yevgeniy and Malin, Bradley A. Large language models are less effective at clinical prediction tasks than locally trained machine learning models. Journal of the American Medical Info...
-
[25]
Šuvalov, Hendrik and Lepson, Mihkel and Kukk, Veronika and Malk, Maria and Ilves, Neeme and Kuulmets, Hele-Andra and Kolde, Raivo. Using synthetic health care data to leverage large language models for named entity recognition: Development and validation study. Journal of medical internet research
-
[26]
Efficient memory management for large language model serving with PagedAttention
Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion. Efficient memory management for large language model serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles
-
[27]
Evaluating hospital course summarization by an electronic health record-based large language model
Small, William R and Austrian, Jonathan and O'Donnell, Luke and Burk-Rafel, Jesse and Hochman, Katherine A and Goodman, Adam and Zaretsky, Jonah and Martin, Jacob and Johnson, Stephen and Major, Vincent J and Jones, Simon and Henke, Christian and Verplanke, Benjamin and Osso, Jwan and Larson, Ian and Saxena, Archana and Mednick, Aron and Simonis, Choumika...
-
[28]
He, Pengcheng and Gao, Jianfeng and Chen, Weizhu. DeBERTaV3 : Improving DeBERTa using ELECTRA -Style Pre-Training with Gradient-Disentangled Embedding Sharing. The Eleventh International Conference on Learning Representations
-
[29]
Liu, Jinghui and Koopman, Bevan and Brown, Nathan J and Chu, Kevin and Nguyen, Anthony. Generating synthetic clinical text with local large language models to identify misdiagnosed limb fractures in radiology reports. Artificial intelligence in medicine
-
[30]
SMOG grading - A new readability formula
Harry, G and Laughlin, Mc. SMOG grading - A new readability formula. The Journal of Reading
-
[31]
Schwieger, Arne and Angst, Katrin and de Bardeci, Mateo and Burrer, Achim and Cathomas, Flurin and Ferrea, Stefano and Grätz, Franziska and Knorr, Marius and Kronenberg, Golo and Spiller, Tobias and Troi, David and Seifritz, Erich and Weber, Samantha and Olbrich, Sebastian. Large language models can support generation of standardized discharge summaries -...
-
[32]
Equality of opportunity in supervised learning
Hardt, Moritz and Price, Eric and Srebro, Nathan. Equality of opportunity in supervised learning. Proceedings of the 30th International Conference on Neural Information Processing Systems
-
[33]
DataComp - LM : In search of the next generation of training sets for language models
Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir Yitzhak and Bansal, Hritik and Guha, Etash Kumar and Keh, Sedrick and Arora, Kushal and Garg, Saurabh and Xin, Rui and Muennighoff, Niklas and Heckel, Reinhard and Mercat, Jean and Chen, Mayee F and Gururangan, Suchin and Wortsman, Mitchell and Albalak, Alon ...
-
[34]
Rephrasing Electronic Health Records for Pretraining Clinical Language Models
Liu, Jinghui and Nguyen, Anthony. Rephrasing Electronic Health Records for Pretraining Clinical Language Models. Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association
-
[35]
Evaluating text complexity and Flesch-Kincaid grade level
Solnyshkina, M and Zamaletdinov, R and Gorodetskaya, L and Gabitov, A. Evaluating text complexity and Flesch-Kincaid grade level. Journal of social studies education research
-
[36]
MIMIC- III , a freely accessible critical care database
Johnson, Alistair E W and Pollard, Tom J and Shen, Lu and Lehman, Li-Wei H and Feng, Mengling and Ghassemi, Mohammad and Moody, Benjamin and Szolovits, Peter and Celi, Leo Anthony and Mark, Roger G. MIMIC- III , a freely accessible critical care database. Scientific data
-
[37]
Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti and Mendes, Caio César Teodoro and Del Giorno, Allie and Gopi, Sivakanth and Javaheripi, Mojan and Kauffmann, Piero and de Rosa, Gustavo and Saarikivi, Olli and Salim, Adil and Shah, Shital and Behl, Harkirat Singh and Wang, Xin and Bubeck, Sébastien and Eldan, Ronen and Kalai, Adam Tauman and Lee, Yin Tat ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
A Survey of Evaluation Metrics Used for NLG Systems
Sai, Ananya B and Mohankumar, Akash Kumar and Khapra, Mitesh M. A Survey of Evaluation Metrics Used for NLG Systems. ACM Comput. Surv
-
[39]
Li, Rumeng and Wang, Xun and Yu, Hong. Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data. Findings of the Association for Computational Linguistics: EMNLP 2023
work page 2023
-
[40]
ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission
Huang, Kexin and Altosaar, Jaan and Ranganath, Rajesh. ClinicalBERT : Modeling Clinical Notes and Predicting Hospital Readmission. arXiv [cs.CL]. arXiv:1904.05342
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[41]
Hierarchical label-wise attention transformer model for explainable ICD coding
Liu, Leibo and Perez-Concha, Oscar and Nguyen, Anthony and Bennett, Vicki and Jorm, Louisa. Hierarchical label-wise attention transformer model for explainable ICD coding. Journal of biomedical informatics
-
[42]
Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction
Si, Yuqi and Roberts, Kirk. Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science
-
[43]
MIMIC- IV , a freely accessible electronic health record dataset
Johnson, Alistair E W and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J and Moody, Benjamin and Gow, Brian and Lehman, Li-Wei H and Celi, Leo A and Mark, Roger G. MIMIC- IV , a freely accessible electronic health record dataset. Scientific data
-
[44]
Colicchio, Tiago K and Cimino, James J and Del Fiol, Guilherme. Unintended Consequences of Nationwide Electronic Health Record Adoption: Challenges and Opportunities in the Post-Meaningful Use Era. Journal of medical Internet research
-
[45]
Evaluating progress in automatic chest X -ray radiology report generation
Yu, Feiyang and Endo, Mark and Krishnan, Rayan and Pan, Ian and Tsai, Andy and Reis, Eduardo Pontes and Fonseca, Eduardo Kaiser Ururahy and Lee, Henrique Min Ho and Abad, Zahra Shakeri Hossein and Ng, Andrew Y and Langlotz, Curtis P and Venugopal, Vasantha Kumar and Rajpurkar, Pranav. Evaluating progress in automatic chest X -ray radiology report generati...
-
[46]
``Note Bloat'' impacts deep learning-based NLP models for clinical prediction tasks
Liu, Jinghui and Capurro, Daniel and Nguyen, Anthony and Verspoor, Karin. ``Note Bloat'' impacts deep learning-based NLP models for clinical prediction tasks. Journal of biomedical informatics
-
[47]
DRG - LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients
Wang, Hanyin and Gao, Chufan and Dantona, Christopher and Hull, Bryan and Sun, Jimeng. DRG - LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ digital medicine. arXiv:2309.12625
-
[48]
PLM - ICD : Automatic ICD Coding with Pretrained Language Models
Huang, Chao-Wei and Tsai, Shang-Chi and Chen, Yun-Nung. PLM - ICD : Automatic ICD Coding with Pretrained Language Models. Proceedings of the 4th Clinical Natural Language Processing Workshop
-
[49]
ROUGE : A Package for Automatic Evaluation of Summaries
Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out
-
[50]
Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, Lélio Renard and Lachaux, Marie-Anne and Stock, Pierre and Le Scao, Teven and Lavril, Thibaut and Wang, Thomas and Lacroix, Ti...
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Tang, Ruixiang and Han, Xiaotian and Jiang, Xiaoqian and Hu, Xia. Does Synthetic Data Generation of LLMs Help Clinical Text Mining?. arXiv [cs.CL]. arXiv:2303.04360
-
[52]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Penedo, Guilherme and Kydlíček, Hynek and Allal, Loubna Ben and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Von Werra, Leandro and Wolf, Thomas. The FineWeb datasets: Decanting the web for the finest text data at scale. arXiv [cs.CL]. arXiv:2406.17557
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
BERTScore : Evaluating Text Generation with BERT
Zhang*, Tianyi and Kishore*, Varsha and Wu*, Felix and Weinberger, Kilian Q and Artzi, Yoav. BERTScore : Evaluating Text Generation with BERT. International Conference on Learning Representations
-
[54]
Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes
Kweon, Sunjun and Kim, Junu and Kim, Jiyoun and Im, Sujeong and Cho, Eunbyeol and Bae, Seongsu and Oh, Jungwoo and Lee, Gyubok and Moon, Jong Hak and You, Seng Chan and Baek, Seungjin and Han, Chang Hoon and Jung, Yoon Bin and Jo, Yohan and Choi, Edward. Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes. Findings of the As...
work page 2024
-
[55]
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Maini, Pratyush and Seto, Skyler and Bai, Richard and Grangier, David and Zhang, Yizhe and Jaitly, Navdeep. Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
-
[56]
Xu, Justin and Chen, Zhihong and Johnston, Andrew and Blankemeier, Louis and Varma, Maya and Hom, Jason and Collins, William J and Modi, Ankit and Lloyd, Robert and Hopkins, Benjamin and Langlotz, Curtis and Delbrouck, Jean-Benoit. Overview of the First Shared Task on Clinical Text Generation: RRG24 and “Discharge Me!”. Proceedings of the 23rd Workshop on...
-
[57]
Boag, William and Kovaleva, Olga and McCoy, Jr, Thomas H and Rumshisky, Anna and Szolovits, Peter and Perlis, Roy H. Hard for humans, hard for machines: predicting readmission after psychiatric hospitalization using narrative notes. Translational psychiatry
-
[58]
Nicolson, Aaron and Liu, Jinghui and Dowling, Jason and Nguyen, Anthony and Koopman, Bevan. e-Health CSIRO at RRG24 : Entropy-Augmented Self-Critical Sequence Training for Radiology Report Generation. Proceedings of the 23rd Workshop on Biomedical Natural Language Processing
-
[59]
Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
RULER : What's the Real Context Size of Your Long-Context Language Models?
Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Ginsburg, Boris. RULER : What's the Real Context Size of Your Long-Context Language Models?. First Conference on Language Modeling
-
[61]
Lost in the middle: How language models use long contexts
Liu, Nelson F and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics
-
[62]
A unified review of deep learning for automated medical coding
Ji, Shaoxiong and Li, Xiaobo and Sun, Wei and Dong, Hang and Taalas, Ara and Zhang, Yijia and Wu, Honghan and Pitkänen, Esa and Marttinen, Pekka. A unified review of deep learning for automated medical coding. ACM computing surveys
-
[63]
Current and future state of evaluation of large language models for medical summarization tasks
Croxford, Emma and Gao, Yanjun and Pellegrino, Nicholas and Wong, Karen and Wills, Graham and First, Elliot and Liao, Frank and Goswami, Cherodeep and Patterson, Brian and Afshar, Majid. Current and future state of evaluation of large language models for medical summarization tasks. npj Health Systems
-
[64]
Gao, Yanjun and Dligach, Dmitriy and Miller, Timothy and Xu, Dongfang and Churpek, Matthew M M and Afshar, Majid. Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models. Proceedings of the 29th International Conference on Computational Linguistics
-
[65]
Paraphrasing to improve the performance of Electronic Health Records Question Answering
Soni, Sarvesh and Roberts, Kirk. Paraphrasing to improve the performance of Electronic Health Records Question Answering. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science
-
[66]
AI models collapse when trained on recursively generated data
Shumailov, Ilia and Shumaylov, Zakhar and Zhao, Yiren and Papernot, Nicolas and Anderson, Ross and Gal, Yarin. AI models collapse when trained on recursively generated data. Nature
-
[67]
QuickUMLS : a fast, unsupervised approach for medical concept extraction
Soldaini, Luca. QuickUMLS : a fast, unsupervised approach for medical concept extraction. MedIR workshop, sigir
-
[68]
Large language models for reducing clinicians' documentation burden
Roberts, Kirk. Large language models for reducing clinicians' documentation burden. Nature medicine
-
[69]
A survey of bias in machine learning through the prism of statistical parity
Besse, Philippe and del Barrio, Eustasio and Gordaliza, Paula and Loubes, Jean-Michel and Risser, Laurent. A survey of bias in machine learning through the prism of statistical parity. The American statistician
-
[70]
Gan, Yidong and Rybinski, Maciej and Hachey, Ben and Kummerfeld, Jonathan K. Aligning AI research with the needs of clinical coding workflows: Eight recommendations based on US data analysis and critical review. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
-
[71]
DocLens : Multi-aspect fine-grained medical text evaluation
Xie, Yiqing and Zhang, Sheng and Cheng, Hao and Liu, Pengfei and Gero, Zelalem and Wong, Cliff and Naumann, Tristan and Poon, Hoifung and Rose, Carolyn. DocLens : Multi-aspect fine-grained medical text evaluation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
-
[72]
Less is more: Explainable and efficient ICD code prediction with clinical entities
Douglas, James C and Gan, Yidong and Hachey, Ben and Kummerfeld, Jonathan K. Less is more: Explainable and efficient ICD code prediction with clinical entities. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.