Assessing the Generalizability of Deep Neural Networks-Based Models for Black Skin Lesions
Pith reviewed 2026-05-24 06:27 UTC · model grok-4.3
The pith
Deep neural network models for skin lesion diagnosis perform poorly on black skin lesions from acral regions compared to white skin.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that deep neural network models for skin lesion classification, which are trained mostly on datasets of white skin tones, exhibit poor generalizability to acral skin lesions typical in black patients. When tested on a carefully curated acral dataset stratified by the Fitzpatrick scale, the models deliver favorable performance only for lesions on white skin.
What carries the argument
Performance evaluation of supervised and self-supervised models on a curated dataset of acral skin lesions assessed via the Fitzpatrick skin tone scale.
If this is right
- Diverse datasets covering multiple skin tones are required for equitable diagnostic performance.
- Specialized models may need to be developed for accurate detection of acral lesions on black skin.
- Without such inclusion, AI tools cannot deliver benefits to populations with limited access to dermatology.
- Neglecting black skin lesions in dataset creation prevents responsible use of these technologies.
Where Pith is reading between the lines
- Data biases in training sets could translate into unequal real-world melanoma detection rates across skin tones.
- Techniques such as targeted data collection or adaptation methods might mitigate the observed gaps.
- Repeating the evaluation on additional independent datasets would help isolate skin tone as the causal factor.
Load-bearing premise
That observed performance gaps stem from skin tone differences rather than confounding factors such as image quality, lesion subtype distribution, or image acquisition site.
What would settle it
Finding equivalent accuracy on a new, large collection of black acral lesions after matching for image quality, subtype mix, and acquisition conditions would falsify the claim.
Figures
read the original abstract
Melanoma is the most severe type of skin cancer due to its ability to cause metastasis. It is more common in black people, often affecting acral regions: palms, soles, and nails. Deep neural networks have shown tremendous potential for improving clinical care and skin cancer diagnosis. Nevertheless, prevailing studies predominantly rely on datasets of white skin tones, neglecting to report diagnostic outcomes for diverse patient skin tones. In this work, we evaluate supervised and self-supervised models in skin lesion images extracted from acral regions commonly observed in black individuals. Also, we carefully curate a dataset containing skin lesions in acral regions and assess the datasets concerning the Fitzpatrick scale to verify performance on black skin. Our results expose the poor generalizability of these models, revealing their favorable performance for lesions on white skin. Neglecting to create diverse datasets, which necessitates the development of specialized models, is unacceptable. Deep neural networks have great potential to improve diagnosis, particularly for populations with limited access to dermatology. However, including black skin lesions is necessary to ensure these populations can access the benefits of inclusive technology.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates supervised and self-supervised deep neural network models on skin lesion images extracted from acral regions, curates a new dataset of such lesions assessed on the Fitzpatrick scale to represent black skin, and concludes that the models exhibit poor generalizability to black skin lesions while performing favorably on white skin.
Significance. If the performance gap can be rigorously attributed to skin tone after controlling for confounders, the result would underscore an important fairness limitation in dermatological AI and support calls for more inclusive datasets.
major comments (2)
- [Dataset curation and evaluation] Dataset curation section: the manuscript provides no indication that white-skin comparison sets (e.g., ISIC-derived) were matched or stratified on lesion subtype distribution, image resolution/quality, or acquisition site/device; without such controls the attribution of lower performance to skin tone rather than these confounders cannot be established.
- [Results] Results and abstract: the central claim of 'poor generalizability' is stated without any reported quantitative metrics (accuracy, AUC, etc.), confidence intervals, dataset sizes, or statistical tests comparing the acral black-skin set to the white-skin baseline, so the data-to-claim link cannot be evaluated.
minor comments (1)
- [Abstract] Abstract: key numerical results and dataset sizes should be included to allow readers to assess the magnitude of the reported performance differences.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important aspects of dataset controls and quantitative reporting that we will address in the revision. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Dataset curation and evaluation] Dataset curation section: the manuscript provides no indication that white-skin comparison sets (e.g., ISIC-derived) were matched or stratified on lesion subtype distribution, image resolution/quality, or acquisition site/device; without such controls the attribution of lower performance to skin tone rather than these confounders cannot be established.
Authors: We agree that explicit matching or stratification on lesion subtype, image quality, and acquisition factors would strengthen causal attribution to skin tone. The current manuscript focuses on curating and evaluating a new acral black-skin dataset against standard white-skin benchmarks (ISIC-derived) without describing such controls. In the revised version we will expand the Dataset Curation section to report all available metadata on the comparison sets, note any limitations in matching, and add a discussion of potential confounders. Where data permit, we will also include supplementary analyses that stratify or match on subtype distribution. revision: yes
-
Referee: [Results] Results and abstract: the central claim of 'poor generalizability' is stated without any reported quantitative metrics (accuracy, AUC, etc.), confidence intervals, dataset sizes, or statistical tests comparing the acral black-skin set to the white-skin baseline, so the data-to-claim link cannot be evaluated.
Authors: The full manuscript contains experimental results, yet we acknowledge that the abstract and Results section do not present the quantitative metrics, confidence intervals, dataset sizes, or statistical comparisons in sufficient detail. In the revision we will update the abstract with key performance numbers and ensure the Results section includes all accuracy/AUC values, confidence intervals, sample sizes, and appropriate statistical tests for the black-skin versus white-skin comparisons. revision: yes
Circularity Check
Empirical evaluation with no derivation chain or fitted predictions
full rationale
This is a dataset curation and model evaluation study. The central claim (poor generalizability to black/acral lesions) rests on direct performance measurements across datasets, not on any equation, parameter fit, or self-citation that reduces the result to its own inputs. No equations, uniqueness theorems, or ansatzes appear in the provided text. The skeptic concern about confounders is a validity issue, not a circularity issue. Score 0 is the appropriate finding for an honest empirical paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The Fitzpatrick scale provides a sufficient proxy for skin-tone-related appearance variation in lesion images.
- domain assumption The selected supervised and self-supervised models are representative of current practice in skin-lesion classification.
Reference graph
Works this paper leans on
-
[1]
Key statistics for melanoma skin cancer
American Cancer Society. Key statistics for melanoma skin cancer. https://www. cancer.org/cancer/melanoma-skin-cancer/about/key-statistics.html , 2022
work page 2022
-
[2]
AIM at Melanoma Foundation. What is acral lentiginous melanoma? https://www.aimatmelanoma.org/melanoma-101/types-of-melanoma/ cutaneous-melanoma/acral-lentiginous-melanoma/
-
[3]
Memorial Sloan Kettering Cancer Center. Types of melanoma. https://www. mskcc.org/cancer-care/types/melanoma/types-melanoma, 2022
work page 2022
-
[4]
Melanoma acral-estudo cl´ ınico e epidemiol´ ogico.Surgical & Cosmetic Dermatology , 2020
Yara Alves Caetano, Ana Maria Quinteiro Ribeiro, Bruno Ricardo da Silva Al- bernaz, Isabella de Paula Eleut´ erio, and Luiz Fernando Fleury Fr´ oes. Melanoma acral-estudo cl´ ınico e epidemiol´ ogico.Surgical & Cosmetic Dermatology , 2020
work page 2020
-
[5]
Dermatology has a problem with skin color
Roni Caryn Rabin. Dermatology has a problem with skin color. https://www-nytimes-com.cdn.ampproject.org/c/s/www.nytimes.com/2020/ 08/30/health/skin-diseases-black-hispanic.amp.html , 2020
work page 2020
-
[6]
Knowledge transfer for melanoma screening with deep learning
Afonso Menegola, Michel Fornaciali, Ramon Pires, Fl´ avia Vasques Bittencourt, Sandra Avila, and Eduardo Valle. Knowledge transfer for melanoma screening with deep learning. International Symposium on Biomedical Imaging , 2017
work page 2017
-
[7]
An evaluation of self-supervised pre-training for skin-lesion analysis
Levy Chaves, Alceu Bissoto, Eduardo Valle, and Sandra Avila. An evaluation of self-supervised pre-training for skin-lesion analysis. In European Conference on Computer Vision Workshops , 2022
work page 2022
-
[8]
Decolonising dermatology: why black and brown skin need better treatment
Neil Singh. Decolonising dermatology: why black and brown skin need better treatment. The Guardian, 13, 2020
work page 2020
-
[9]
DermNet. Fitzpatrick skin phototype. https://dermnetnz.org/topics/ skin-phototype, 2012
work page 2012
-
[10]
Skin cancer in african-americans
Dermatology Learning Network. Skin cancer in african-americans. https://www. hmpgloballearningnetwork.com/site/thederm/article/2547, 2004
work page 2004
-
[11]
Matthew Groh, Caleb Harris, Luis Soenksen, Felix Lau, Rachel Han, Aerin Kim, Arash Koochek, and Omar Badri. Evaluating deep neural networks trained on clinical images in dermatology with the fitzpatrick 17k dataset. In Conference on Computer Vision and Pattern Recognition , 2021
work page 2021
-
[12]
Acral melanoma detection using a convolu- tional neural network for dermoscopy images
Chanki Yu, Sejung Yang, Wonoh Kim, Jinwoong Jung, Kee-Yang Chung, Sang Wook Lee, and Byungho Oh. Acral melanoma detection using a convolu- tional neural network for dermoscopy images. PloS one , 2018
work page 2018
-
[13]
S Lee, YS Chu, SK Yoo, S Choi, SJ Choe, SB Koh, KY Chung, L Xing, B Oh, and S Yang. Augmented decision-making for acral lentiginous melanoma detection using deep convolutional neural networks. Journal of the European Academy of Dermatology and Venereology, 2020. 14 L. Barros et al
work page 2020
-
[14]
Acral melanoma detection using dermoscopic images and convolutional neural networks
Qaiser Abbas, Farheen Ramzan, and Muhammad Usman Ghani. Acral melanoma detection using dermoscopic images and convolutional neural networks. Visual Computing for Industry, Biomedicine, and Art , 2021
work page 2021
-
[15]
Skin type diversity: a case study in skin lesion datasets
Neda Alipour, Ted Burke, and Jane Courtney. Skin type diversity: a case study in skin lesion datasets. 2023
work page 2023
-
[16]
Andre Pacheco, Gustavo Lima, Amanda Salom˜ ao, Breno Krohling, Igor Biral, Gabriel Angelo, F´ abio Jr, Jos´ e Esgario, Alana Simora, Pedro Castro, Felipe Ro- drigues, Patricia Frasson, et al. PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data in Brief , 2020
work page 2020
-
[17]
Novoa, Melissa Jenkins, Weixin Liang, Veronica Rotemberg, Justin Ko, et al
Roxana Daneshjou, Kailas Vodrahalli, Roberto A. Novoa, Melissa Jenkins, Weixin Liang, Veronica Rotemberg, Justin Ko, et al. Disparities in dermatology ai perfor- mance on a diverse, curated clinical image set. Science Advances, 2022
work page 2022
-
[18]
Detecting melanoma fairly: Skin tone detection and debiasing for skin lesion classification
Peter J Bevan and Amir Atapour-Abarghouei. Detecting melanoma fairly: Skin tone detection and debiasing for skin lesion classification. In MICCAI Workshop on Domain Adaptation and Representation Transfer , pages 1–11, 2022
work page 2022
-
[19]
Circle: Color invariant representation learning for unbiased classification of skin lesions
Arezou Pakzad, Kumar Abhishek, and Ghassan Hamarneh. Circle: Color invariant representation learning for unbiased classification of skin lesions. In European Conference on Computer Vision , 2022
work page 2022
-
[20]
Improving skin color diversity in cancer detection: deep learning approach
Eman Rezk, Mohamed Eltorki, Wael El-Dakhakhni, et al. Improving skin color diversity in cancer detection: deep learning approach. JMIR Dermatology , 5(3):e39143
-
[21]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009
work page 2009
- [22]
-
[23]
Data, depth, and design: Learning reliable models for skin lesion analysis
Eduardo Valle, Michel Fornaciali, Afonso Menegola, Julia Tavares, Fl´ avia Vasques Bittencourt, Lin Tzy Li, and Sandra Avila. Data, depth, and design: Learning reliable models for skin lesion analysis. Neurocomputing, 2020
work page 2020
-
[24]
Deep residual learn- ing for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn- ing for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016
work page 2016
-
[25]
Boot- strap your own latent - a new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altch´ e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, et al. Boot- strap your own latent - a new approach to self-supervised learning. In Advances in Neural Information Processing Systems , 2020
work page 2020
-
[26]
Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? In Advances in Neural Information Processing Systems , 2020
work page 2020
-
[27]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Conference on Com- puter Vision and Pattern Recognition , 2020
work page 2020
-
[28]
A sim- ple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A sim- ple framework for contrastive learning of visual representations. In International Conference on Machine Learning , 2020
work page 2020
-
[29]
Unsupervised learning of visual features by contrasting cluster assignments
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems , 2020
work page 2020
-
[30]
Seven-point checklist and skin lesion classification using multitask multimodal neu- ral nets
Jeremy Kawahara, Sara Daneshvar, Giuseppe Argenziano, and Ghassan Hamarneh. Seven-point checklist and skin lesion classification using multitask multimodal neu- ral nets. IEEE Journal of Biomedical and Health Informatics , 2019. Title Suppressed Due to Excessive Length 15
work page 2019
-
[31]
Richard P. Usatine and Brian D. Madden. Interactive dermatology atlas. https: //www.dermatlas.net, 2023
work page 2023
-
[32]
https: //www.dermis.net/dermisroot/pt/home/index.htm, 2023
Dermis.net: Dermatology information service available on the internet. https: //www.dermis.net/dermisroot/pt/home/index.htm, 2023
work page 2023
- [33]
-
[34]
Giuseppe Argenziano, Gabriella Fabbrocini, Paolo Carli, Vincenzo De Giorgi, Elena Sammarco, and Mario Delfino. Epiluminescence microscopy for the diagnosis of doubtful melanocytic skin lesions: comparison of the abcd rule of dermatoscopy and a new 7-point checklist based on pattern analysis. Archives of dermatology , 134(12):1563–1570, 1998
work page 1998
- [35]
-
[36]
Samuel Freire da Silva. Atlas dermatologico. http://atlasdermatologico.com. br. 16 L. Barros et al. Appendix A In this section, we stratified the results of Table 4 by Fitzpatrick scale. Tables A1, A2, and A3 shows the results for DDI, Fitzpatrick 17k, and PAD-UFES-20* datasets, respectively. Table A1: Evaluation metrics for DDI dataset. #Mel and #Ben ind...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.