pith. sign in

arxiv: 2605.25020 · v1 · pith:YOSIE2WKnew · submitted 2026-05-24 · 💻 cs.AI · cs.CL

Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

Pith reviewed 2026-06-30 11:25 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords pemphiguslongitudinal recordssmall language modelsprivacy-preserving AIclinical feature extractiondermatology summarieschronic disease follow-uplocal deployment
0
0 comments X

The pith

A locally deployed small language model extracts 56 clinical features from pemphigus patient records at 82 percent accuracy and generates summaries that dermatologists prefer in over half of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Chronic dermatologic diseases produce lengthy longitudinal records that are difficult to review fully during visits. The paper tests a privacy-preserving small language model running entirely on local hardware to pull out 56 specific features and create one final summary from each complete patient history. On records from thirty pemphigus patients the model reached mean accuracy of 82.25 percent across 1680 retrieval tasks. Dermatologists gave the resulting summaries high marks for quality, accuracy, and usefulness, preferring the model output in 53.3 percent of direct comparisons. The results indicate that such local models can produce clinically usable longitudinal summaries without transmitting patient data outside the clinic.

Core claim

The authors aggregated 541 visit notes from thirty pemphigus patients into full longitudinal records and had two expert dermatologists annotate 56 clinically relevant features. They then prompted the locally deployed Qwen3 4B Thinking 2507 model with each complete record to retrieve the features and generate final report summaries. The model achieved 82.25 percent mean accuracy on the 1680 feature tasks. Dermatologist ratings of the generated summaries averaged 8.23-8.50 across quality, clinical accuracy, and usefulness scales with no significant difference between evaluators, and the AI summaries were preferred in 53.3 percent of evaluations. The authors conclude that privacy-preserving loc

What carries the argument

The locally deployed Qwen3 4B Thinking 2507 small language model prompted with each patient's complete longitudinal record to retrieve 56 annotated features and produce one final summary.

If this is right

  • Clinicians could review extensive longitudinal histories more quickly during routine visits.
  • The approach reduces the chance of overlooking critical past information in chronic disease management.
  • Local deployment keeps all patient data inside the clinic and avoids external transmission.
  • The same workflow could be applied to other chronic dermatologic conditions that generate long follow-up records.
  • Integration into clinical systems could support decision-making when used with human oversight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-model approach might be tested on records from other chronic conditions such as psoriasis or atopic dermatitis to check generalizability.
  • Prospective trials could measure whether access to these summaries changes actual treatment decisions or patient outcomes.
  • Fine-tuning the model on dermatology-specific notes could raise feature accuracy above the current 82 percent level.
  • Pairing the model with electronic health record interfaces could allow on-the-fly summary generation during visits.

Load-bearing premise

The two dermatologists' annotations of the 56 features count as reliable ground truth, and their quality ratings and preference counts show the model outperforms experts.

What would settle it

A larger study in which additional independent dermatologists re-annotate the same records and rate new model summaries, resulting in accuracy below 70 percent or preference for human notes in most cases, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.25020 by Abdurrahim Yilmaz, Ay\c{s}e Esra Koku Aksu, Burak Temelkuran, Duygu Yamen, Gulsum Gencoglan, Joram M. Posma, Mehmet Salih Gurel, Vefa Asli Erdemir.

Figure 1
Figure 1. Figure 1: Overview of the study workflow: longitudinal pemphigus follow-up notes are aggregated into patient-level records and processed locally by a privacy-preserving SLM to retrieve structured clinical features and generate longitudinal summaries, which are then evaluated based on expert annotations [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison between clinician-written ground truth reports and AI-generated reports: Each example consists of two parts, with the clinician-written ground truth report shown above and the AI-generated response shown below. The figure contains text-only excerpts annotated for qualitative analysis. Correct or clinically useful information is highlighted in green, explanatory content that does not … view at source ↗
read the original abstract

Chronic dermatologic diseases such as pemphigus require long-term follow-up, generating extensive longitudinal clinical documentation that is difficult to review comprehensively during routine visits and increasing clinician workload as well as the risk of missing critical historical information. We evaluated whether a locally deployed, privacy-preserving small language model (SLM) could retrieve structured clinical features and generate longitudinal summaries from long-term dermatology follow-up records. In this retrospective case series, thirty pemphigus patients contributed 541 visit notes that were aggregated into full longitudinal records (89,336 words); 56 clinically relevant features were annotated by two expert dermatologists. The locally deployed SLM (Qwen3 4B Thinking 2507) was queried with each complete record to retrieve 56 features and generate one final report summaries. Across 1,680 feature retrieval tasks, mean accuracy was 82.25%. Dermatologists' ratings of AI-generated summaries were high for overall quality (8.23-8.47), clinical accuracy (7.93-8.20), and usefulness (8.47-8.50), with no significant inter-evaluator differences and an overall preference for AI summaries in 53.3% of evaluations. These findings suggest that privacy-preserving, locally deployed SLMs can outperform medical experts and reliably generate clinically meaningful longitudinal summaries. SLMs may support clinical decision-making when integrated with appropriate oversight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates a locally deployed small language model (Qwen3 4B Thinking) on 541 visit notes from 30 pemphigus patients (aggregated to 89,336 words) for retrieving 56 expert-annotated clinical features and generating longitudinal summaries. It reports 82.25% mean accuracy across 1,680 retrieval tasks, high dermatologist Likert ratings (overall quality 8.23-8.47, clinical accuracy 7.93-8.20, usefulness 8.47-8.50), and 53.3% overall preference for the AI summaries, concluding that privacy-preserving local SLMs can outperform medical experts and support clinical decision-making.

Significance. If the evaluation methodology is strengthened to directly test superiority, the work would provide useful evidence on the viability of on-device SLMs for longitudinal record summarization in chronic disease management, with the local deployment approach addressing privacy constraints that limit cloud-based medical AI. The retrospective design with real expert annotations on a focused disease cohort is a positive aspect of the empirical setup.

major comments (2)
  1. [Abstract] Abstract: the central claim that the findings 'suggest that privacy-preserving, locally deployed SLMs can outperform medical experts' is not supported by the reported results. Feature retrieval accuracy (82.25%) quantifies agreement with the two dermatologists' annotations of the 56 features rather than any direct comparison of model performance against expert-generated outputs. The 53.3% preference rate for AI summaries is presented without a described comparator arm (e.g., no blinded evaluation against summaries written by the same experts) or statistical testing, so the data support approximation and favorable ratings but not outperformance.
  2. [Results (summary evaluation)] Summary evaluation section: the preference testing reports an overall 53.3% rate favoring AI summaries with high Likert scores but provides no details on how the comparator summaries were generated, who produced them, or whether the preference task was blinded and balanced. Without this, the preference metric cannot substantiate superiority over expert performance.
minor comments (2)
  1. [Abstract] Abstract: 'one final report summaries' appears to be a grammatical error and should be clarified.
  2. [Methods] The manuscript does not report inter-rater agreement statistics between the two expert dermatologists on the 56-feature annotations, which would strengthen the reliability of the ground-truth labels used for the 82.25% accuracy calculation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We appreciate the careful reading and the identification of areas where our claims and reporting require strengthening. We will make revisions to moderate the language regarding outperformance and to provide additional methodological details on the summary evaluation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the findings 'suggest that privacy-preserving, locally deployed SLMs can outperform medical experts' is not supported by the reported results. Feature retrieval accuracy (82.25%) quantifies agreement with the two dermatologists' annotations of the 56 features rather than any direct comparison of model performance against expert-generated outputs. The 53.3% preference rate for AI summaries is presented without a described comparator arm (e.g., no blinded evaluation against summaries written by the same experts) or statistical testing, so the data support approximation and favorable ratings but not outperformance.

    Authors: We concur that the central claim in the abstract regarding outperformance is not substantiated by the results as presented. The accuracy metric reflects agreement with expert annotations, and the preference rate does not include a direct comparison to summaries generated by the annotating experts or statistical analysis. We will revise the abstract to remove this claim and instead emphasize the high agreement and positive ratings received by the AI-generated summaries. revision: yes

  2. Referee: [Results (summary evaluation)] Summary evaluation section: the preference testing reports an overall 53.3% rate favoring AI summaries with high Likert scores but provides no details on how the comparator summaries were generated, who produced them, or whether the preference task was blinded and balanced. Without this, the preference metric cannot substantiate superiority over expert performance.

    Authors: We acknowledge that the summary evaluation section lacks sufficient detail on the comparator summaries. In the revised manuscript, we will expand this section to fully describe the generation of comparator summaries, the blinding and balancing of the preference task, and report statistical testing for the preference rates to allow proper interpretation of the results. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation with no derivations or self-referential constructions

full rationale

The manuscript is a retrospective case series reporting empirical metrics (feature retrieval accuracy of 82.25% across 1,680 tasks, Likert-scale quality ratings, and 53.3% preference rate) from direct comparison of SLM outputs against two dermatologists' annotations of 56 features in 541 notes. No equations, parameter fitting, predictive derivations, or self-citations appear in the provided text or abstract. The central claim rests on these external benchmarks rather than any reduction to the study's own inputs by construction. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new postulated entities; the work is an empirical application of an off-the-shelf model to annotated medical text.

pith-pipeline@v0.9.1-grok · 5834 in / 1198 out tokens · 47616 ms · 2026-06-30T11:25:34.676972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 2 internal anchors

  1. [1]

    & Joly, P

    Schmidt, E., Kasperkiewicz, M. & Joly, P. Pemphigus.The Lancet394, 882–894, DOI: 10.1016/S0140-6736(19)31778-7 (2019)

  2. [2]

    Medicine30, 1134–1142, DOI: 10.1038/s41591-024-02855-5 (2024)

    Van Veen, D.et al.Adapted large language models can outperform medical experts in clinical text summarization.Nat. Medicine30, 1134–1142, DOI: 10.1038/s41591-024-02855-5 (2024)

  3. [3]

    Medicine5, 186, DOI: 10.1038/s41746-022-00730-6 (2022)

    Wu, H.et al.A survey on clinical natural language processing in the United Kingdom from 2007 to 2022.npj Digit. Medicine5, 186, DOI: 10.1038/s41746-022-00730-6 (2022)

  4. [4]

    A Primer on Neural Network Models for Natural Language Processing.J

    Goldberg, Y . A Primer on Neural Network Models for Natural Language Processing.J. Artif. Intell. Res.57, 345–420, DOI: 10.1613/jair.4992 (2016)

  5. [5]

    Paganelli, A.et al.Natural language processing in dermatology: A systematic literature review and state of the art.J. Eur. Acad. Dermatol. V enereol.38, 2225–2234, DOI: 10.1111/jdv.20286 (2024). _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/jdv.20286. 7.Azarfar, G.et al.Responsible adoption of multimodal artificial intelligence in health care: ...

  6. [6]

    Yilmaz, F

    Yilmaz, A.et al.Resource-efficient medical vision language model for dermatology via a synthetic data generation framework, DOI: 10.1101/2025.05.17.25327785 (2025). ISSN: 3067-2007 Pages: 2025.05.17.25327785

  7. [7]

    Dis.9, ofac471, DOI: 10.1093/ofid/ofac471 (2022)

    Goodman-Meza, D.et al.Natural Language Processing and Machine Learning to Identify People Who Inject Drugs in Electronic Health Records.Open F orum Infect. Dis.9, ofac471, DOI: 10.1093/ofid/ofac471 (2022)

  8. [8]

    Bootsma-Robroeks, C. M. H. H. T.et al.AI-generated draft replies to patient messages: exploring effects of implementation. Front. Digit. Heal.7, DOI: 10.3389/fdgth.2025.1588143 (2025)

  9. [9]

    S., Yuan, W., Poddar, M., Elsamadisi, P

    Marwaha, J. S., Yuan, W., Poddar, M., Elsamadisi, P. & Brat, G. A. The algorithmic consultant: a new era of clinical AI calls for a new workforce of physician-algorithm specialists.npj Digit. Medicine8, 552, DOI: 10.1038/s41746-025-01960-0 (2025)

  10. [10]

    Wang, J.et al.Systematic Evaluation of Research Progress on Natural Language Processing in Medicine Over the Past 20 Years: Bibliometric Study on PubMed.J. Med. Internet Res.22, e16816, DOI: 10.2196/16816 (2020). Company: Journal of Medical Internet Research Distributor: Journal of Medical Internet Research Institution: Journal of Medical Internet Researc...

  11. [11]

    J., Reyes Nieva, H., Lee, S

    Bear Don’t Walk, O. J., Reyes Nieva, H., Lee, S. S.-J. & Elhadad, N. A scoping review of ethics considerations in clinical natural language processing.JAMIA Open5, ooac039, DOI: 10.1093/jamiaopen/ooac039 (2022)

  12. [12]

    Qwen3 Technical Report

    Obika, D.et al.Safety principles for medical summarization using generative AI.Nat. Medicine30, 3417–3419, DOI: 10.1038/s41591-024-03313-y (2024). 15.Yang, A.et al.Qwen3 Technical Report, DOI: 10.48550/arXiv.2505.09388 (2025). ArXiv:2505.09388 [cs]

  13. [13]

    Dymek, C.et al.Building the evidence-base to reduce electronic health record–related clinician burden.J. Am. Med. Informatics Assoc.28, 1057–1061, DOI: 10.1093/jamia/ocaa238 (2021)

  14. [14]

    K.et al.How can artificial intelligence decrease cognitive and work burden for front line practitioners?JAMIA Open6, ooad079, DOI: 10.1093/jamiaopen/ooad079 (2023)

    Gandhi, T. K.et al.How can artificial intelligence decrease cognitive and work burden for front line practitioners?JAMIA Open6, ooad079, DOI: 10.1093/jamiaopen/ooad079 (2023)

  15. [15]

    MedGemma Technical Report

    Center for AI Safetyet al.A benchmark of expert-level academic questions to assess AI capabilities.Nature649, 1139–1146, DOI: 10.1038/s41586-025-09962-4 (2026). 19.Sellergren, A.et al.MedGemma Technical Report, DOI: 10.48550/arXiv.2507.05201 (2025). ArXiv:2507.05201 [cs]

  16. [16]

    Shah, S. V . Accuracy, Consistency, and Hallucination of Large Language Models When Analyzing Unstructured Clinical Notes in Electronic Medical Records.JAMA Netw. Open7, e2425953, DOI: 10.1001/jamanetworkopen.2024.25953 (2024)

  17. [17]

    Du, X.et al.Testing and Evaluation of Generative Large Language Models in Electronic Health Record Applications: A Systematic Review, DOI: 10.1101/2024.08.11.24311828 (2024)

  18. [18]

    High” if at least 2 of the following are concordant: histology, direct immunofluorescence, anti-DSG1/3 values; 2) “Low

    Kim, Y .et al.Medical Hallucinations in Foundation Models and Their Impact on Healthcare, DOI: 10.48550/arXiv.2503. 05777 (2025). ArXiv:2503.05777 [cs]. 9/12 Appendix The main prompt and clinical features with their input type and notes. Each clinical feature were prompted separately. Main Prompt:You are an expert dermatologist tasked with summarizing lon...