Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology
Pith reviewed 2026-05-09 19:56 UTC · model grok-4.3
The pith
Benchmark performance substantially overestimates the real-world clinical capability of current dermatology multimodal LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On public benchmarks the top open-weight model reached 26.55 percent top-3 diagnostic accuracy and GPT-4.1 reached 42.25 percent; on the real-world cohort of 5,811 cases these figures fell to between 1.50 and 13.35 percent for open-weight models and 24.65 percent for GPT-4.1 when using images alone. Incorporating clinical context raised the figures to at most 28.75 percent and 38.93 percent respectively, but model outputs remained highly sensitive to incomplete or erroneous context. Severity triage achieved moderate sensitivity above 60 percent.
What carries the argument
Retrospective multi-site hospital dermatology consultation cohort of 5,811 cases and 46,405 images, evaluated on differential diagnosis generation and severity-based triage against public benchmark results.
If this is right
- Diagnostic accuracy declines substantially from public benchmarks to real-world hospital cases across all tested models.
- Incorporating clinical context improves performance but does not reach levels suitable for reliable clinical deployment.
- Model outputs are highly sensitive to incomplete or erroneous consultation context.
- Severity-based triage shows moderate sensitivity, suggesting possible utility for screening but not for standalone clinical decisions.
Where Pith is reading between the lines
- Development of future dermatology MLLMs should emphasize robustness to partial or noisy clinical information.
- Prospective clinical studies would be required to determine whether any current model can achieve acceptable safety in actual practice.
- The observed gap underscores the value of creating additional real-world validation cohorts for medical multimodal models.
Load-bearing premise
The retrospective multi-site hospital consultation cohort is representative of typical clinical dermatology practice and the chosen tasks of differential diagnosis generation and severity triage fully capture clinical decision-making needs.
What would settle it
A prospective deployment of the same models in live dermatology clinics that achieves diagnostic accuracy comparable to the models' public benchmark scores would falsify the claim that benchmarks substantially overestimate real-world capability.
Figures
read the original abstract
Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and 42.25% for GPT-4.1. On real-world consultation cases using images alone, top-3 diagnostic accuracy fell to 1.50%-13.35% among open-weight models and 24.65% for GPT-4.1. Incorporating clinical context improved performance across all models, increasing top-3 diagnostic accuracy up to 28.75% among open-weight models and 38.93% for GPT-4.1. However, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity-based triage, models achieved moderate sensitivity (above 60%), suggesting potential utility for screening but insufficient reliability for clinical deployment. These findings demonstrate that benchmark performance substantially overestimates the real-world clinical capability of current dermatology MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates four open-weight multimodal LLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct) and GPT-4.1 on three public dermatology datasets plus a retrospective multi-site hospital consultation cohort of 5,811 cases and 46,405 images. Models are tested on differential diagnosis generation (top-k accuracy) and severity-based triage, showing modest public-benchmark performance that declines on the real-world cohort (e.g., GPT-4.1 top-3 accuracy falls from 42.25% to 24.65% with images alone, rising to 38.93% with context). The central claim is that public benchmarks substantially overestimate real-world clinical capability of current dermatology MLLMs.
Significance. If the findings hold after addressing representativeness, the work is significant because it supplies one of the largest real-world dermatology MLLM evaluations to date, with a 5,811-case multi-site cohort, 46,405 images, five models, and consistent top-k plus triage metrics across tasks. This provides concrete empirical evidence of a benchmark-to-bedside gap and supports calls for more realistic testing before clinical deployment.
major comments (1)
- [Methods, Real-World Cohort Description] The claim that benchmark performance overestimates real-world capability is load-bearing on the assumption that the 5,811-case retrospective multi-site hospital consultation cohort reflects typical clinical dermatology practice. Hospital consultations are enriched for referred, diagnostically ambiguous, or severe cases; the observed drop (GPT-4.1 top-3 accuracy 42.25% public vs. 24.65% real-world images alone) could therefore reflect case-difficulty shift rather than benchmark failure. The manuscript reports no case-complexity metrics, referral-source stratification, or primary-care comparator cohort (Methods, Real-World Cohort Description).
minor comments (2)
- [Results, Triage Evaluation] The moderate sensitivity (>60%) reported for severity triage would be more informative if specificity, PPV, or full confusion matrices were also provided to assess overall clinical utility.
- [Discussion] The noted sensitivity of outputs to incomplete or erroneous context is important; quantifying the performance drop under controlled missing-context simulations would strengthen the practical implications.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. The major comment raises an important point about the representativeness of our real-world cohort, which we address directly below. We agree this is a limitation and will revise the manuscript to discuss it explicitly.
read point-by-point responses
-
Referee: The claim that benchmark performance overestimates real-world capability is load-bearing on the assumption that the 5,811-case retrospective multi-site hospital consultation cohort reflects typical clinical dermatology practice. Hospital consultations are enriched for referred, diagnostically ambiguous, or severe cases; the observed drop (GPT-4.1 top-3 accuracy 42.25% public vs. 24.65% real-world images alone) could therefore reflect case-difficulty shift rather than benchmark failure. The manuscript reports no case-complexity metrics, referral-source stratification, or primary-care comparator cohort (Methods, Real-World Cohort Description).
Authors: We agree that the retrospective hospital consultation cohort is likely enriched for more complex, referred, and diagnostically ambiguous cases relative to typical primary-care dermatology, and that this case-mix difference may partially explain the performance drop. The manuscript does not report case-complexity metrics, referral-source stratification, or a primary-care comparator, which limits our ability to fully isolate model capability from case difficulty. At the same time, hospital-based specialist consultations represent a core and high-stakes component of real-world dermatologic practice; the substantial gap between curated public benchmarks and this setting still demonstrates that current benchmarks overestimate performance where it matters most. In the revised manuscript we will add a paragraph to the Discussion explicitly acknowledging this limitation, discussing the potential contribution of case-difficulty shift, and recommending that future evaluations include primary-care cohorts for direct comparison. revision: yes
Circularity Check
No circularity: pure empirical evaluation on independent datasets
full rationale
The paper conducts a straightforward empirical evaluation of MLLM diagnostic and triage performance on three public dermatology benchmarks versus a separate retrospective multi-site hospital cohort (5,811 cases, 46,405 images). Performance is measured directly against external ground-truth labels using standard top-k accuracy and sensitivity metrics. There are no equations, derivations, fitted parameters, predictions, ansatzes, or uniqueness theorems. No self-citations are used to justify load-bearing claims. The central finding (benchmark performance substantially overestimates real-world results) follows immediately from the reported accuracy drops and does not reduce to its own inputs by construction. The evaluation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Accuracy of diagnoses LLM-as-a-judgeLLM-as-a-judge BERTScore Clinician evaluation Clinician evaluation ! " ⚙
-
[2]
Nat Med28, 1773–1784 (2022) https://doi.org/10.1038/s41591-022-01981-2
Severe diagnosis detection “Triage”3. Diagnosis reasoning $ % 16 Figure 2. Top-3 diagnostic accuracy across datasets and models. Top-3 accuracy denotes the proportion of cases in which the reference diagnosis appeared among the top three model predictions. Accuracy was assessed using LLM-as-a-judge on a sampled subset of each dataset (n=2,000 per dataset)...
-
[3]
Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4
Zhou J, He X, Sun L, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat Commun. 2024;15(1):5649. doi:10.1038/s41467-024-50043-3 15. Ru J, Yan S, Yin Y, Zou Y, Ge Z. DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs. arXiv. Preprint posted online January 5, 202...
-
[4]
Chiou AS, Omiye JA, Gui H, et al. Multimodal Image Dataset for AI-based Skin Cancer (MIDAS) Benchmarking. Dermatology. Preprint posted online June 28, 2024. doi:10.1101/2024.06.27.24309562 27. Han T, Adams LC, Bressem KK, Busch F, Nebelung S, Truhn D. Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions. JAMA....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.