Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

Abdurrahim Yilmaz; Ay\c{s}e Esra Koku Aksu; Burak Temelkuran; Duygu Yamen; Gulsum Gencoglan; Joram M. Posma; Mehmet Salih Gurel; Vefa Asli Erdemir

arxiv: 2605.25020 · v1 · pith:YOSIE2WKnew · submitted 2026-05-24 · 💻 cs.AI · cs.CL

Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

Abdurrahim Yilmaz , Ay\c{s}e Esra Koku Aksu , Duygu Yamen , Vefa Asli Erdemir , Mehmet Salih Gurel , Gulsum Gencoglan , Joram M. Posma , Burak Temelkuran This is my paper

Pith reviewed 2026-06-30 11:25 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords pemphiguslongitudinal recordssmall language modelsprivacy-preserving AIclinical feature extractiondermatology summarieschronic disease follow-uplocal deployment

0 comments

The pith

A locally deployed small language model extracts 56 clinical features from pemphigus patient records at 82 percent accuracy and generates summaries that dermatologists prefer in over half of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Chronic dermatologic diseases produce lengthy longitudinal records that are difficult to review fully during visits. The paper tests a privacy-preserving small language model running entirely on local hardware to pull out 56 specific features and create one final summary from each complete patient history. On records from thirty pemphigus patients the model reached mean accuracy of 82.25 percent across 1680 retrieval tasks. Dermatologists gave the resulting summaries high marks for quality, accuracy, and usefulness, preferring the model output in 53.3 percent of direct comparisons. The results indicate that such local models can produce clinically usable longitudinal summaries without transmitting patient data outside the clinic.

Core claim

The authors aggregated 541 visit notes from thirty pemphigus patients into full longitudinal records and had two expert dermatologists annotate 56 clinically relevant features. They then prompted the locally deployed Qwen3 4B Thinking 2507 model with each complete record to retrieve the features and generate final report summaries. The model achieved 82.25 percent mean accuracy on the 1680 feature tasks. Dermatologist ratings of the generated summaries averaged 8.23-8.50 across quality, clinical accuracy, and usefulness scales with no significant difference between evaluators, and the AI summaries were preferred in 53.3 percent of evaluations. The authors conclude that privacy-preserving loc

What carries the argument

The locally deployed Qwen3 4B Thinking 2507 small language model prompted with each patient's complete longitudinal record to retrieve 56 annotated features and produce one final summary.

If this is right

Clinicians could review extensive longitudinal histories more quickly during routine visits.
The approach reduces the chance of overlooking critical past information in chronic disease management.
Local deployment keeps all patient data inside the clinic and avoids external transmission.
The same workflow could be applied to other chronic dermatologic conditions that generate long follow-up records.
Integration into clinical systems could support decision-making when used with human oversight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-model approach might be tested on records from other chronic conditions such as psoriasis or atopic dermatitis to check generalizability.
Prospective trials could measure whether access to these summaries changes actual treatment decisions or patient outcomes.
Fine-tuning the model on dermatology-specific notes could raise feature accuracy above the current 82 percent level.
Pairing the model with electronic health record interfaces could allow on-the-fly summary generation during visits.

Load-bearing premise

The two dermatologists' annotations of the 56 features count as reliable ground truth, and their quality ratings and preference counts show the model outperforms experts.

What would settle it

A larger study in which additional independent dermatologists re-annotate the same records and rate new model summaries, resulting in accuracy below 70 percent or preference for human notes in most cases, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.25020 by Abdurrahim Yilmaz, Ay\c{s}e Esra Koku Aksu, Burak Temelkuran, Duygu Yamen, Gulsum Gencoglan, Joram M. Posma, Mehmet Salih Gurel, Vefa Asli Erdemir.

**Figure 1.** Figure 1: Overview of the study workflow: longitudinal pemphigus follow-up notes are aggregated into patient-level records and processed locally by a privacy-preserving SLM to retrieve structured clinical features and generate longitudinal summaries, which are then evaluated based on expert annotations [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative comparison between clinician-written ground truth reports and AI-generated reports: Each example consists of two parts, with the clinician-written ground truth report shown above and the AI-generated response shown below. The figure contains text-only excerpts annotated for qualitative analysis. Correct or clinically useful information is highlighted in green, explanatory content that does not … view at source ↗

read the original abstract

Chronic dermatologic diseases such as pemphigus require long-term follow-up, generating extensive longitudinal clinical documentation that is difficult to review comprehensively during routine visits and increasing clinician workload as well as the risk of missing critical historical information. We evaluated whether a locally deployed, privacy-preserving small language model (SLM) could retrieve structured clinical features and generate longitudinal summaries from long-term dermatology follow-up records. In this retrospective case series, thirty pemphigus patients contributed 541 visit notes that were aggregated into full longitudinal records (89,336 words); 56 clinically relevant features were annotated by two expert dermatologists. The locally deployed SLM (Qwen3 4B Thinking 2507) was queried with each complete record to retrieve 56 features and generate one final report summaries. Across 1,680 feature retrieval tasks, mean accuracy was 82.25%. Dermatologists' ratings of AI-generated summaries were high for overall quality (8.23-8.47), clinical accuracy (7.93-8.20), and usefulness (8.47-8.50), with no significant inter-evaluator differences and an overall preference for AI summaries in 53.3% of evaluations. These findings suggest that privacy-preserving, locally deployed SLMs can outperform medical experts and reliably generate clinically meaningful longitudinal summaries. SLMs may support clinical decision-making when integrated with appropriate oversight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets 82% feature extraction accuracy with a local model on pemphigus notes but does not show the model outperforms experts at making summaries.

read the letter

The main takeaway is that a small local model can pull structured features from long dermatology records with reasonable accuracy and produce summaries that raters score well. They report 82.25% mean accuracy on 1,680 feature tasks against two dermatologists' labels, plus Likert scores around 8 for quality and usefulness, with 53.3% preference for the AI versions.

What the work does is apply an existing SLM to a narrow but practical longitudinal retrieval task in one chronic skin disease, keeping everything on-device for privacy. The numbers are concrete and the setup is retrospective case series on real notes, which is a legitimate domain check.

The soft spot is the claim that the model outperforms medical experts. Feature accuracy measures agreement with the experts' annotations, not better performance. The preference rate for AI summaries lacks a clear comparator arm where the same experts produce their own summaries for blinded rating, and no statistical test is mentioned. Without that, the data support approximation and acceptable quality, not superiority. Inter-rater reliability between the two annotators is also not reported.

This is for applied researchers or clinicians testing privacy-preserving tools in dermatology or similar longitudinal record work. A reader who wants a real-world example of local model use on messy notes can get value from the metrics, but anyone expecting a strong head-to-head result will be disappointed.

It deserves peer review. The empirical core is solid enough to warrant referee input even if the interpretation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper evaluates a locally deployed small language model (Qwen3 4B Thinking) on 541 visit notes from 30 pemphigus patients (aggregated to 89,336 words) for retrieving 56 expert-annotated clinical features and generating longitudinal summaries. It reports 82.25% mean accuracy across 1,680 retrieval tasks, high dermatologist Likert ratings (overall quality 8.23-8.47, clinical accuracy 7.93-8.20, usefulness 8.47-8.50), and 53.3% overall preference for the AI summaries, concluding that privacy-preserving local SLMs can outperform medical experts and support clinical decision-making.

Significance. If the evaluation methodology is strengthened to directly test superiority, the work would provide useful evidence on the viability of on-device SLMs for longitudinal record summarization in chronic disease management, with the local deployment approach addressing privacy constraints that limit cloud-based medical AI. The retrospective design with real expert annotations on a focused disease cohort is a positive aspect of the empirical setup.

major comments (2)

[Abstract] Abstract: the central claim that the findings 'suggest that privacy-preserving, locally deployed SLMs can outperform medical experts' is not supported by the reported results. Feature retrieval accuracy (82.25%) quantifies agreement with the two dermatologists' annotations of the 56 features rather than any direct comparison of model performance against expert-generated outputs. The 53.3% preference rate for AI summaries is presented without a described comparator arm (e.g., no blinded evaluation against summaries written by the same experts) or statistical testing, so the data support approximation and favorable ratings but not outperformance.
[Results (summary evaluation)] Summary evaluation section: the preference testing reports an overall 53.3% rate favoring AI summaries with high Likert scores but provides no details on how the comparator summaries were generated, who produced them, or whether the preference task was blinded and balanced. Without this, the preference metric cannot substantiate superiority over expert performance.

minor comments (2)

[Abstract] Abstract: 'one final report summaries' appears to be a grammatical error and should be clarified.
[Methods] The manuscript does not report inter-rater agreement statistics between the two expert dermatologists on the 56-feature annotations, which would strengthen the reliability of the ground-truth labels used for the 82.25% accuracy calculation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We appreciate the careful reading and the identification of areas where our claims and reporting require strengthening. We will make revisions to moderate the language regarding outperformance and to provide additional methodological details on the summary evaluation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the findings 'suggest that privacy-preserving, locally deployed SLMs can outperform medical experts' is not supported by the reported results. Feature retrieval accuracy (82.25%) quantifies agreement with the two dermatologists' annotations of the 56 features rather than any direct comparison of model performance against expert-generated outputs. The 53.3% preference rate for AI summaries is presented without a described comparator arm (e.g., no blinded evaluation against summaries written by the same experts) or statistical testing, so the data support approximation and favorable ratings but not outperformance.

Authors: We concur that the central claim in the abstract regarding outperformance is not substantiated by the results as presented. The accuracy metric reflects agreement with expert annotations, and the preference rate does not include a direct comparison to summaries generated by the annotating experts or statistical analysis. We will revise the abstract to remove this claim and instead emphasize the high agreement and positive ratings received by the AI-generated summaries. revision: yes
Referee: [Results (summary evaluation)] Summary evaluation section: the preference testing reports an overall 53.3% rate favoring AI summaries with high Likert scores but provides no details on how the comparator summaries were generated, who produced them, or whether the preference task was blinded and balanced. Without this, the preference metric cannot substantiate superiority over expert performance.

Authors: We acknowledge that the summary evaluation section lacks sufficient detail on the comparator summaries. In the revised manuscript, we will expand this section to fully describe the generation of comparator summaries, the blinding and balancing of the preference task, and report statistical testing for the preference rates to allow proper interpretation of the results. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation with no derivations or self-referential constructions

full rationale

The manuscript is a retrospective case series reporting empirical metrics (feature retrieval accuracy of 82.25% across 1,680 tasks, Likert-scale quality ratings, and 53.3% preference rate) from direct comparison of SLM outputs against two dermatologists' annotations of 56 features in 541 notes. No equations, parameter fitting, predictive derivations, or self-citations appear in the provided text or abstract. The central claim rests on these external benchmarks rather than any reduction to the study's own inputs by construction. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new postulated entities; the work is an empirical application of an off-the-shelf model to annotated medical text.

pith-pipeline@v0.9.1-grok · 5834 in / 1198 out tokens · 47616 ms · 2026-06-30T11:25:34.676972+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 2 internal anchors

[1]

& Joly, P

Schmidt, E., Kasperkiewicz, M. & Joly, P. Pemphigus.The Lancet394, 882–894, DOI: 10.1016/S0140-6736(19)31778-7 (2019)

work page doi:10.1016/s0140-6736(19)31778-7 2019
[2]

Medicine30, 1134–1142, DOI: 10.1038/s41591-024-02855-5 (2024)

Van Veen, D.et al.Adapted large language models can outperform medical experts in clinical text summarization.Nat. Medicine30, 1134–1142, DOI: 10.1038/s41591-024-02855-5 (2024)

work page doi:10.1038/s41591-024-02855-5 2024
[3]

Medicine5, 186, DOI: 10.1038/s41746-022-00730-6 (2022)

Wu, H.et al.A survey on clinical natural language processing in the United Kingdom from 2007 to 2022.npj Digit. Medicine5, 186, DOI: 10.1038/s41746-022-00730-6 (2022)

work page doi:10.1038/s41746-022-00730-6 2007
[4]

A Primer on Neural Network Models for Natural Language Processing.J

Goldberg, Y . A Primer on Neural Network Models for Natural Language Processing.J. Artif. Intell. Res.57, 345–420, DOI: 10.1613/jair.4992 (2016)

work page doi:10.1613/jair.4992 2016
[5]

Paganelli, A.et al.Natural language processing in dermatology: A systematic literature review and state of the art.J. Eur. Acad. Dermatol. V enereol.38, 2225–2234, DOI: 10.1111/jdv.20286 (2024). _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/jdv.20286. 7.Azarfar, G.et al.Responsible adoption of multimodal artificial intelligence in health care: ...

work page doi:10.1111/jdv.20286 2024
[6]

Yilmaz, F

Yilmaz, A.et al.Resource-efficient medical vision language model for dermatology via a synthetic data generation framework, DOI: 10.1101/2025.05.17.25327785 (2025). ISSN: 3067-2007 Pages: 2025.05.17.25327785

work page doi:10.1101/2025.05.17.25327785 2025
[7]

Dis.9, ofac471, DOI: 10.1093/ofid/ofac471 (2022)

Goodman-Meza, D.et al.Natural Language Processing and Machine Learning to Identify People Who Inject Drugs in Electronic Health Records.Open F orum Infect. Dis.9, ofac471, DOI: 10.1093/ofid/ofac471 (2022)

work page doi:10.1093/ofid/ofac471 2022
[8]

Bootsma-Robroeks, C. M. H. H. T.et al.AI-generated draft replies to patient messages: exploring effects of implementation. Front. Digit. Heal.7, DOI: 10.3389/fdgth.2025.1588143 (2025)

work page doi:10.3389/fdgth.2025.1588143 2025
[9]

S., Yuan, W., Poddar, M., Elsamadisi, P

Marwaha, J. S., Yuan, W., Poddar, M., Elsamadisi, P. & Brat, G. A. The algorithmic consultant: a new era of clinical AI calls for a new workforce of physician-algorithm specialists.npj Digit. Medicine8, 552, DOI: 10.1038/s41746-025-01960-0 (2025)

work page doi:10.1038/s41746-025-01960-0 2025
[10]

Wang, J.et al.Systematic Evaluation of Research Progress on Natural Language Processing in Medicine Over the Past 20 Years: Bibliometric Study on PubMed.J. Med. Internet Res.22, e16816, DOI: 10.2196/16816 (2020). Company: Journal of Medical Internet Research Distributor: Journal of Medical Internet Research Institution: Journal of Medical Internet Researc...

work page doi:10.2196/16816 2020
[11]

J., Reyes Nieva, H., Lee, S

Bear Don’t Walk, O. J., Reyes Nieva, H., Lee, S. S.-J. & Elhadad, N. A scoping review of ethics considerations in clinical natural language processing.JAMIA Open5, ooac039, DOI: 10.1093/jamiaopen/ooac039 (2022)

work page doi:10.1093/jamiaopen/ooac039 2022
[12]

Qwen3 Technical Report

Obika, D.et al.Safety principles for medical summarization using generative AI.Nat. Medicine30, 3417–3419, DOI: 10.1038/s41591-024-03313-y (2024). 15.Yang, A.et al.Qwen3 Technical Report, DOI: 10.48550/arXiv.2505.09388 (2025). ArXiv:2505.09388 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41591-024-03313-y 2024
[13]

Dymek, C.et al.Building the evidence-base to reduce electronic health record–related clinician burden.J. Am. Med. Informatics Assoc.28, 1057–1061, DOI: 10.1093/jamia/ocaa238 (2021)

work page doi:10.1093/jamia/ocaa238 2021
[14]

K.et al.How can artificial intelligence decrease cognitive and work burden for front line practitioners?JAMIA Open6, ooad079, DOI: 10.1093/jamiaopen/ooad079 (2023)

Gandhi, T. K.et al.How can artificial intelligence decrease cognitive and work burden for front line practitioners?JAMIA Open6, ooad079, DOI: 10.1093/jamiaopen/ooad079 (2023)

work page doi:10.1093/jamiaopen/ooad079 2023
[15]

MedGemma Technical Report

Center for AI Safetyet al.A benchmark of expert-level academic questions to assess AI capabilities.Nature649, 1139–1146, DOI: 10.1038/s41586-025-09962-4 (2026). 19.Sellergren, A.et al.MedGemma Technical Report, DOI: 10.48550/arXiv.2507.05201 (2025). ArXiv:2507.05201 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09962-4 2026
[16]

Shah, S. V . Accuracy, Consistency, and Hallucination of Large Language Models When Analyzing Unstructured Clinical Notes in Electronic Medical Records.JAMA Netw. Open7, e2425953, DOI: 10.1001/jamanetworkopen.2024.25953 (2024)

work page doi:10.1001/jamanetworkopen.2024.25953 2024
[17]

Du, X.et al.Testing and Evaluation of Generative Large Language Models in Electronic Health Record Applications: A Systematic Review, DOI: 10.1101/2024.08.11.24311828 (2024)

work page doi:10.1101/2024.08.11.24311828 2024
[18]

High” if at least 2 of the following are concordant: histology, direct immunofluorescence, anti-DSG1/3 values; 2) “Low

Kim, Y .et al.Medical Hallucinations in Foundation Models and Their Impact on Healthcare, DOI: 10.48550/arXiv.2503. 05777 (2025). ArXiv:2503.05777 [cs]. 9/12 Appendix The main prompt and clinical features with their input type and notes. Each clinical feature were prompted separately. Main Prompt:You are an expert dermatologist tasked with summarizing lon...

work page doi:10.48550/arxiv.2503 2025

[1] [1]

& Joly, P

Schmidt, E., Kasperkiewicz, M. & Joly, P. Pemphigus.The Lancet394, 882–894, DOI: 10.1016/S0140-6736(19)31778-7 (2019)

work page doi:10.1016/s0140-6736(19)31778-7 2019

[2] [2]

Medicine30, 1134–1142, DOI: 10.1038/s41591-024-02855-5 (2024)

Van Veen, D.et al.Adapted large language models can outperform medical experts in clinical text summarization.Nat. Medicine30, 1134–1142, DOI: 10.1038/s41591-024-02855-5 (2024)

work page doi:10.1038/s41591-024-02855-5 2024

[3] [3]

Medicine5, 186, DOI: 10.1038/s41746-022-00730-6 (2022)

Wu, H.et al.A survey on clinical natural language processing in the United Kingdom from 2007 to 2022.npj Digit. Medicine5, 186, DOI: 10.1038/s41746-022-00730-6 (2022)

work page doi:10.1038/s41746-022-00730-6 2007

[4] [4]

A Primer on Neural Network Models for Natural Language Processing.J

Goldberg, Y . A Primer on Neural Network Models for Natural Language Processing.J. Artif. Intell. Res.57, 345–420, DOI: 10.1613/jair.4992 (2016)

work page doi:10.1613/jair.4992 2016

[5] [5]

Paganelli, A.et al.Natural language processing in dermatology: A systematic literature review and state of the art.J. Eur. Acad. Dermatol. V enereol.38, 2225–2234, DOI: 10.1111/jdv.20286 (2024). _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/jdv.20286. 7.Azarfar, G.et al.Responsible adoption of multimodal artificial intelligence in health care: ...

work page doi:10.1111/jdv.20286 2024

[6] [6]

Yilmaz, F

Yilmaz, A.et al.Resource-efficient medical vision language model for dermatology via a synthetic data generation framework, DOI: 10.1101/2025.05.17.25327785 (2025). ISSN: 3067-2007 Pages: 2025.05.17.25327785

work page doi:10.1101/2025.05.17.25327785 2025

[7] [7]

Dis.9, ofac471, DOI: 10.1093/ofid/ofac471 (2022)

Goodman-Meza, D.et al.Natural Language Processing and Machine Learning to Identify People Who Inject Drugs in Electronic Health Records.Open F orum Infect. Dis.9, ofac471, DOI: 10.1093/ofid/ofac471 (2022)

work page doi:10.1093/ofid/ofac471 2022

[8] [8]

Bootsma-Robroeks, C. M. H. H. T.et al.AI-generated draft replies to patient messages: exploring effects of implementation. Front. Digit. Heal.7, DOI: 10.3389/fdgth.2025.1588143 (2025)

work page doi:10.3389/fdgth.2025.1588143 2025

[9] [9]

S., Yuan, W., Poddar, M., Elsamadisi, P

Marwaha, J. S., Yuan, W., Poddar, M., Elsamadisi, P. & Brat, G. A. The algorithmic consultant: a new era of clinical AI calls for a new workforce of physician-algorithm specialists.npj Digit. Medicine8, 552, DOI: 10.1038/s41746-025-01960-0 (2025)

work page doi:10.1038/s41746-025-01960-0 2025

[10] [10]

Wang, J.et al.Systematic Evaluation of Research Progress on Natural Language Processing in Medicine Over the Past 20 Years: Bibliometric Study on PubMed.J. Med. Internet Res.22, e16816, DOI: 10.2196/16816 (2020). Company: Journal of Medical Internet Research Distributor: Journal of Medical Internet Research Institution: Journal of Medical Internet Researc...

work page doi:10.2196/16816 2020

[11] [11]

J., Reyes Nieva, H., Lee, S

Bear Don’t Walk, O. J., Reyes Nieva, H., Lee, S. S.-J. & Elhadad, N. A scoping review of ethics considerations in clinical natural language processing.JAMIA Open5, ooac039, DOI: 10.1093/jamiaopen/ooac039 (2022)

work page doi:10.1093/jamiaopen/ooac039 2022

[12] [12]

Qwen3 Technical Report

Obika, D.et al.Safety principles for medical summarization using generative AI.Nat. Medicine30, 3417–3419, DOI: 10.1038/s41591-024-03313-y (2024). 15.Yang, A.et al.Qwen3 Technical Report, DOI: 10.48550/arXiv.2505.09388 (2025). ArXiv:2505.09388 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41591-024-03313-y 2024

[13] [13]

Dymek, C.et al.Building the evidence-base to reduce electronic health record–related clinician burden.J. Am. Med. Informatics Assoc.28, 1057–1061, DOI: 10.1093/jamia/ocaa238 (2021)

work page doi:10.1093/jamia/ocaa238 2021

[14] [14]

K.et al.How can artificial intelligence decrease cognitive and work burden for front line practitioners?JAMIA Open6, ooad079, DOI: 10.1093/jamiaopen/ooad079 (2023)

Gandhi, T. K.et al.How can artificial intelligence decrease cognitive and work burden for front line practitioners?JAMIA Open6, ooad079, DOI: 10.1093/jamiaopen/ooad079 (2023)

work page doi:10.1093/jamiaopen/ooad079 2023

[15] [15]

MedGemma Technical Report

Center for AI Safetyet al.A benchmark of expert-level academic questions to assess AI capabilities.Nature649, 1139–1146, DOI: 10.1038/s41586-025-09962-4 (2026). 19.Sellergren, A.et al.MedGemma Technical Report, DOI: 10.48550/arXiv.2507.05201 (2025). ArXiv:2507.05201 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09962-4 2026

[16] [16]

Shah, S. V . Accuracy, Consistency, and Hallucination of Large Language Models When Analyzing Unstructured Clinical Notes in Electronic Medical Records.JAMA Netw. Open7, e2425953, DOI: 10.1001/jamanetworkopen.2024.25953 (2024)

work page doi:10.1001/jamanetworkopen.2024.25953 2024

[17] [17]

Du, X.et al.Testing and Evaluation of Generative Large Language Models in Electronic Health Record Applications: A Systematic Review, DOI: 10.1101/2024.08.11.24311828 (2024)

work page doi:10.1101/2024.08.11.24311828 2024

[18] [18]

High” if at least 2 of the following are concordant: histology, direct immunofluorescence, anti-DSG1/3 values; 2) “Low

Kim, Y .et al.Medical Hallucinations in Foundation Models and Their Impact on Healthcare, DOI: 10.48550/arXiv.2503. 05777 (2025). ArXiv:2503.05777 [cs]. 9/12 Appendix The main prompt and clinical features with their input type and notes. Each clinical feature were prompted separately. Main Prompt:You are an expert dermatologist tasked with summarizing lon...

work page doi:10.48550/arxiv.2503 2025