Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

Ailish Hanly; Angela Sadlowski; Hyunjae Kim; Jeffrey Gehlhausen; Margaret MacGibeny; Morten Lee; Qingyu Chen; Roy Jiang; Shanin Chowdhury; Xuguang Ai

arxiv: 2605.04098 · v1 · submitted 2026-05-01 · 💻 cs.CV · cs.AI· cs.CY

Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

Roy Jiang , Hyunjae Kim , Zhenyue Qin , Morten Lee , Margaret MacGibeny , Ailish Hanly , Angela Sadlowski , Shanin Chowdhury

show 3 more authors

Xuguang Ai Jeffrey Gehlhausen Qingyu Chen

This is my paper

Pith reviewed 2026-05-09 19:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CY

keywords multimodal LLMsdermatologyclinical evaluationbenchmark gapdifferential diagnosisseverity triagereal-world performance

0 comments

The pith

Benchmark performance substantially overestimates the real-world clinical capability of current dermatology multimodal LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests four open-weight multimodal LLMs and one commercial model on two tasks: generating differential diagnoses and performing severity-based triage. On public dermatology datasets the best models reach moderate accuracy, but when the same models are applied to a large retrospective set of real hospital consultation cases the accuracy falls sharply, especially when using images without additional context. Adding clinical notes improves results somewhat yet leaves performance too low and too sensitive to missing or wrong details for safe clinical use. These results indicate that existing benchmark scores do not predict how the models will behave in ordinary dermatology practice.

Core claim

On public benchmarks the top open-weight model reached 26.55 percent top-3 diagnostic accuracy and GPT-4.1 reached 42.25 percent; on the real-world cohort of 5,811 cases these figures fell to between 1.50 and 13.35 percent for open-weight models and 24.65 percent for GPT-4.1 when using images alone. Incorporating clinical context raised the figures to at most 28.75 percent and 38.93 percent respectively, but model outputs remained highly sensitive to incomplete or erroneous context. Severity triage achieved moderate sensitivity above 60 percent.

What carries the argument

Retrospective multi-site hospital dermatology consultation cohort of 5,811 cases and 46,405 images, evaluated on differential diagnosis generation and severity-based triage against public benchmark results.

If this is right

Diagnostic accuracy declines substantially from public benchmarks to real-world hospital cases across all tested models.
Incorporating clinical context improves performance but does not reach levels suitable for reliable clinical deployment.
Model outputs are highly sensitive to incomplete or erroneous consultation context.
Severity-based triage shows moderate sensitivity, suggesting possible utility for screening but not for standalone clinical decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Development of future dermatology MLLMs should emphasize robustness to partial or noisy clinical information.
Prospective clinical studies would be required to determine whether any current model can achieve acceptable safety in actual practice.
The observed gap underscores the value of creating additional real-world validation cohorts for medical multimodal models.

Load-bearing premise

The retrospective multi-site hospital consultation cohort is representative of typical clinical dermatology practice and the chosen tasks of differential diagnosis generation and severity triage fully capture clinical decision-making needs.

What would settle it

A prospective deployment of the same models in live dermatology clinics that achieves diagnostic accuracy comparable to the models' public benchmark scores would falsify the claim that benchmarks substantially overestimate real-world capability.

Figures

Figures reproduced from arXiv: 2605.04098 by Ailish Hanly, Angela Sadlowski, Hyunjae Kim, Jeffrey Gehlhausen, Margaret MacGibeny, Morten Lee, Qingyu Chen, Roy Jiang, Shanin Chowdhury, Xuguang Ai, Zhenyue Qin.

**Figure 1.** Figure 1: Study design for evaluation of multimodal large language models (MLLMs) in consultative dermatology. Public image-only datasets included DermnetNZ, Fitzpatrick17k, and the SCIN dataset. These were compared with a real-world retrospective inpatient dermatology consultation cohort consisting of paired clinical images and clinician-authored contextual narratives. Here, n refers to image count, not case count.… view at source ↗

**Figure 2.** Figure 2: Top-3 diagnostic accuracy across datasets and models. Top-3 accuracy denotes the proportion of cases in which the reference diagnosis appeared among the top three model predictions. Accuracy was assessed using LLM-as-a-judge on a sampled subset of each dataset (n=2,000 per dataset); non-responses were excluded, and all valid responses were used in confidence interval calculations. Error bars indicate 95% W… view at source ↗

**Figure 3.** Figure 3: Sensitivity for severe diagnoses identification across public benchmark and hospital consult datasets by model. Panel A shows sensitivity on the public datasets DermNet, Fitz17k, and SCIN. Panel B shows sensitivity on the real-world hospital [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and 42.25% for GPT-4.1. On real-world consultation cases using images alone, top-3 diagnostic accuracy fell to 1.50%-13.35% among open-weight models and 24.65% for GPT-4.1. Incorporating clinical context improved performance across all models, increasing top-3 diagnostic accuracy up to 28.75% among open-weight models and 38.93% for GPT-4.1. However, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity-based triage, models achieved moderate sensitivity (above 60%), suggesting potential utility for screening but insufficient reliability for clinical deployment. These findings demonstrate that benchmark performance substantially overestimates the real-world clinical capability of current dermatology MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a measurable drop in MLLM diagnostic accuracy from public benchmarks to a large hospital consultation cohort, but the cohort's likely bias toward complex cases limits how far the gap generalizes to everyday dermatology.

read the letter

The main thing to know is that this work measures a real decline in top-3 diagnostic accuracy for models like GPT-4.1 and several open-weight MLLMs when tested on 5,811 real hospital consultation cases versus public datasets, with performance falling from 42% to 25% on images alone for the commercial model and even lower for the others. Adding clinical context lifts results somewhat but leaves them sensitive to incomplete inputs, and triage sensitivity stays only moderate. The scale of the real-world set and the consistent metrics across tasks give a clearer picture than most prior benchmark-only studies. They evaluate five models on both diagnosis generation and severity triage, which is a straightforward way to quantify the benchmark-to-bedside difference. The numbers are useful for anyone tracking how these systems might actually perform in clinics. The soft spot is the cohort itself. It comes from multi-site hospital consultations, which tend to include more referred, ambiguous, or severe cases than routine outpatient or primary-care dermatology. The paper does not report case-complexity measures, referral stratification, or a comparator set of simpler cases, so part of the observed drop could reflect harder examples rather than a pure failure of benchmark generalization. That makes the claim that benchmarks substantially overestimate real-world capability narrower than it first appears. This is worth reading for groups working on medical AI evaluation or dermatology applications. It supplies concrete data on current limits even if the size of the gap needs more controls to interpret cleanly. I would send it for peer review so reviewers can push on the cohort selection and suggest any needed additional checks.

Referee Report

1 major / 2 minor

Summary. The manuscript evaluates four open-weight multimodal LLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct) and GPT-4.1 on three public dermatology datasets plus a retrospective multi-site hospital consultation cohort of 5,811 cases and 46,405 images. Models are tested on differential diagnosis generation (top-k accuracy) and severity-based triage, showing modest public-benchmark performance that declines on the real-world cohort (e.g., GPT-4.1 top-3 accuracy falls from 42.25% to 24.65% with images alone, rising to 38.93% with context). The central claim is that public benchmarks substantially overestimate real-world clinical capability of current dermatology MLLMs.

Significance. If the findings hold after addressing representativeness, the work is significant because it supplies one of the largest real-world dermatology MLLM evaluations to date, with a 5,811-case multi-site cohort, 46,405 images, five models, and consistent top-k plus triage metrics across tasks. This provides concrete empirical evidence of a benchmark-to-bedside gap and supports calls for more realistic testing before clinical deployment.

major comments (1)

[Methods, Real-World Cohort Description] The claim that benchmark performance overestimates real-world capability is load-bearing on the assumption that the 5,811-case retrospective multi-site hospital consultation cohort reflects typical clinical dermatology practice. Hospital consultations are enriched for referred, diagnostically ambiguous, or severe cases; the observed drop (GPT-4.1 top-3 accuracy 42.25% public vs. 24.65% real-world images alone) could therefore reflect case-difficulty shift rather than benchmark failure. The manuscript reports no case-complexity metrics, referral-source stratification, or primary-care comparator cohort (Methods, Real-World Cohort Description).

minor comments (2)

[Results, Triage Evaluation] The moderate sensitivity (>60%) reported for severity triage would be more informative if specificity, PPV, or full confusion matrices were also provided to assess overall clinical utility.
[Discussion] The noted sensitivity of outputs to incomplete or erroneous context is important; quantifying the performance drop under controlled missing-context simulations would strengthen the practical implications.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. The major comment raises an important point about the representativeness of our real-world cohort, which we address directly below. We agree this is a limitation and will revise the manuscript to discuss it explicitly.

read point-by-point responses

Referee: The claim that benchmark performance overestimates real-world capability is load-bearing on the assumption that the 5,811-case retrospective multi-site hospital consultation cohort reflects typical clinical dermatology practice. Hospital consultations are enriched for referred, diagnostically ambiguous, or severe cases; the observed drop (GPT-4.1 top-3 accuracy 42.25% public vs. 24.65% real-world images alone) could therefore reflect case-difficulty shift rather than benchmark failure. The manuscript reports no case-complexity metrics, referral-source stratification, or primary-care comparator cohort (Methods, Real-World Cohort Description).

Authors: We agree that the retrospective hospital consultation cohort is likely enriched for more complex, referred, and diagnostically ambiguous cases relative to typical primary-care dermatology, and that this case-mix difference may partially explain the performance drop. The manuscript does not report case-complexity metrics, referral-source stratification, or a primary-care comparator, which limits our ability to fully isolate model capability from case difficulty. At the same time, hospital-based specialist consultations represent a core and high-stakes component of real-world dermatologic practice; the substantial gap between curated public benchmarks and this setting still demonstrates that current benchmarks overestimate performance where it matters most. In the revised manuscript we will add a paragraph to the Discussion explicitly acknowledging this limitation, discussing the potential contribution of case-difficulty shift, and recommending that future evaluations include primary-care cohorts for direct comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical evaluation on independent datasets

full rationale

The paper conducts a straightforward empirical evaluation of MLLM diagnostic and triage performance on three public dermatology benchmarks versus a separate retrospective multi-site hospital cohort (5,811 cases, 46,405 images). Performance is measured directly against external ground-truth labels using standard top-k accuracy and sensitivity metrics. There are no equations, derivations, fitted parameters, predictions, ansatzes, or uniqueness theorems. No self-citations are used to justify load-bearing claims. The central finding (benchmark performance substantially overestimates real-world results) follows immediately from the reported accuracy drops and does not reduce to its own inputs by construction. The evaluation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study relying on standard evaluation practices and existing model architectures without new theoretical constructs, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5659 in / 1067 out tokens · 74281 ms · 2026-05-09T19:56:53.473947+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Accuracy of diagnoses LLM-as-a-judgeLLM-as-a-judge BERTScore Clinician evaluation Clinician evaluation ! " ⚙

work page
[2]

Nat Med28, 1773–1784 (2022) https://doi.org/10.1038/s41591-022-01981-2

Severe diagnosis detection “Triage”3. Diagnosis reasoning $ % 16 Figure 2. Top-3 diagnostic accuracy across datasets and models. Top-3 accuracy denotes the proportion of cases in which the reference diagnosis appeared among the top three model predictions. Accuracy was assessed using LLM-as-a-judge on a sampled subset of each dataset (n=2,000 per dataset)...

work page doi:10.1038/s41591-022-01981-2 2022
[3]

Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4

Zhou J, He X, Sun L, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat Commun. 2024;15(1):5649. doi:10.1038/s41467-024-50043-3 15. Ru J, Yan S, Yin Y, Zou Y, Ge Z. DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs. arXiv. Preprint posted online January 5, 202...

work page doi:10.1038/s41467-024-50043-3 2024
[4]

Real-World

Chiou AS, Omiye JA, Gui H, et al. Multimodal Image Dataset for AI-based Skin Cancer (MIDAS) Benchmarking. Dermatology. Preprint posted online June 28, 2024. doi:10.1101/2024.06.27.24309562 27. Han T, Adams LC, Bressem KK, Busch F, Nebelung S, Truhn D. Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions. JAMA....

work page doi:10.1101/2024.06.27.24309562 2024

[1] [1]

Accuracy of diagnoses LLM-as-a-judgeLLM-as-a-judge BERTScore Clinician evaluation Clinician evaluation ! " ⚙

work page

[2] [2]

Nat Med28, 1773–1784 (2022) https://doi.org/10.1038/s41591-022-01981-2

Severe diagnosis detection “Triage”3. Diagnosis reasoning $ % 16 Figure 2. Top-3 diagnostic accuracy across datasets and models. Top-3 accuracy denotes the proportion of cases in which the reference diagnosis appeared among the top three model predictions. Accuracy was assessed using LLM-as-a-judge on a sampled subset of each dataset (n=2,000 per dataset)...

work page doi:10.1038/s41591-022-01981-2 2022

[3] [3]

Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4

Zhou J, He X, Sun L, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat Commun. 2024;15(1):5649. doi:10.1038/s41467-024-50043-3 15. Ru J, Yan S, Yin Y, Zou Y, Ge Z. DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs. arXiv. Preprint posted online January 5, 202...

work page doi:10.1038/s41467-024-50043-3 2024

[4] [4]

Real-World

Chiou AS, Omiye JA, Gui H, et al. Multimodal Image Dataset for AI-based Skin Cancer (MIDAS) Benchmarking. Dermatology. Preprint posted online June 28, 2024. doi:10.1101/2024.06.27.24309562 27. Han T, Adams LC, Bressem KK, Busch F, Nebelung S, Truhn D. Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions. JAMA....

work page doi:10.1101/2024.06.27.24309562 2024