pith. sign in

arxiv: 2310.00517 · v2 · submitted 2023-09-30 · 💻 cs.CV

Assessing the Generalizability of Deep Neural Networks-Based Models for Black Skin Lesions

Pith reviewed 2026-05-24 06:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords skin lesion classificationdeep neural networksgeneralizabilityskin tone biasacral lesionsFitzpatrick scalemelanoma detectionmedical imaging
0
0 comments X

The pith

Deep neural network models for skin lesion diagnosis perform poorly on black skin lesions from acral regions compared to white skin.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates supervised and self-supervised deep neural network models on skin lesion images from acral regions such as palms, soles, and nails. These regions are common sites for melanoma in black individuals. The authors curate a dedicated dataset and assess model outcomes using the Fitzpatrick skin tone scale. Results show the models generalize poorly overall and perform better on white skin lesions. A sympathetic reader would care because such tools could aid diagnosis in areas with limited dermatology access, yet only if they work across skin tones.

Core claim

The central claim is that deep neural network models for skin lesion classification, which are trained mostly on datasets of white skin tones, exhibit poor generalizability to acral skin lesions typical in black patients. When tested on a carefully curated acral dataset stratified by the Fitzpatrick scale, the models deliver favorable performance only for lesions on white skin.

What carries the argument

Performance evaluation of supervised and self-supervised models on a curated dataset of acral skin lesions assessed via the Fitzpatrick skin tone scale.

If this is right

  • Diverse datasets covering multiple skin tones are required for equitable diagnostic performance.
  • Specialized models may need to be developed for accurate detection of acral lesions on black skin.
  • Without such inclusion, AI tools cannot deliver benefits to populations with limited access to dermatology.
  • Neglecting black skin lesions in dataset creation prevents responsible use of these technologies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data biases in training sets could translate into unequal real-world melanoma detection rates across skin tones.
  • Techniques such as targeted data collection or adaptation methods might mitigate the observed gaps.
  • Repeating the evaluation on additional independent datasets would help isolate skin tone as the causal factor.

Load-bearing premise

That observed performance gaps stem from skin tone differences rather than confounding factors such as image quality, lesion subtype distribution, or image acquisition site.

What would settle it

Finding equivalent accuracy on a new, large collection of black acral lesions after matching for image quality, subtype mix, and acquisition conditions would falsify the claim.

Figures

Figures reproduced from arXiv: 2310.00517 by Levy Chaves, Luana Barros, Sandra Avila.

Figure 1
Figure 1. Figure 1: The Fitzpatrick skin type scale. (a) Type 1 (light): pale skin, always [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Each image corresponds to a melanoma sample and is associated with a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation pipeline for all models. Given a test image, we adopt the final [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Melanoma is the most severe type of skin cancer due to its ability to cause metastasis. It is more common in black people, often affecting acral regions: palms, soles, and nails. Deep neural networks have shown tremendous potential for improving clinical care and skin cancer diagnosis. Nevertheless, prevailing studies predominantly rely on datasets of white skin tones, neglecting to report diagnostic outcomes for diverse patient skin tones. In this work, we evaluate supervised and self-supervised models in skin lesion images extracted from acral regions commonly observed in black individuals. Also, we carefully curate a dataset containing skin lesions in acral regions and assess the datasets concerning the Fitzpatrick scale to verify performance on black skin. Our results expose the poor generalizability of these models, revealing their favorable performance for lesions on white skin. Neglecting to create diverse datasets, which necessitates the development of specialized models, is unacceptable. Deep neural networks have great potential to improve diagnosis, particularly for populations with limited access to dermatology. However, including black skin lesions is necessary to ensure these populations can access the benefits of inclusive technology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates supervised and self-supervised deep neural network models on skin lesion images extracted from acral regions, curates a new dataset of such lesions assessed on the Fitzpatrick scale to represent black skin, and concludes that the models exhibit poor generalizability to black skin lesions while performing favorably on white skin.

Significance. If the performance gap can be rigorously attributed to skin tone after controlling for confounders, the result would underscore an important fairness limitation in dermatological AI and support calls for more inclusive datasets.

major comments (2)
  1. [Dataset curation and evaluation] Dataset curation section: the manuscript provides no indication that white-skin comparison sets (e.g., ISIC-derived) were matched or stratified on lesion subtype distribution, image resolution/quality, or acquisition site/device; without such controls the attribution of lower performance to skin tone rather than these confounders cannot be established.
  2. [Results] Results and abstract: the central claim of 'poor generalizability' is stated without any reported quantitative metrics (accuracy, AUC, etc.), confidence intervals, dataset sizes, or statistical tests comparing the acral black-skin set to the white-skin baseline, so the data-to-claim link cannot be evaluated.
minor comments (1)
  1. [Abstract] Abstract: key numerical results and dataset sizes should be included to allow readers to assess the magnitude of the reported performance differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of dataset controls and quantitative reporting that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Dataset curation and evaluation] Dataset curation section: the manuscript provides no indication that white-skin comparison sets (e.g., ISIC-derived) were matched or stratified on lesion subtype distribution, image resolution/quality, or acquisition site/device; without such controls the attribution of lower performance to skin tone rather than these confounders cannot be established.

    Authors: We agree that explicit matching or stratification on lesion subtype, image quality, and acquisition factors would strengthen causal attribution to skin tone. The current manuscript focuses on curating and evaluating a new acral black-skin dataset against standard white-skin benchmarks (ISIC-derived) without describing such controls. In the revised version we will expand the Dataset Curation section to report all available metadata on the comparison sets, note any limitations in matching, and add a discussion of potential confounders. Where data permit, we will also include supplementary analyses that stratify or match on subtype distribution. revision: yes

  2. Referee: [Results] Results and abstract: the central claim of 'poor generalizability' is stated without any reported quantitative metrics (accuracy, AUC, etc.), confidence intervals, dataset sizes, or statistical tests comparing the acral black-skin set to the white-skin baseline, so the data-to-claim link cannot be evaluated.

    Authors: The full manuscript contains experimental results, yet we acknowledge that the abstract and Results section do not present the quantitative metrics, confidence intervals, dataset sizes, or statistical comparisons in sufficient detail. In the revision we will update the abstract with key performance numbers and ensure the Results section includes all accuracy/AUC values, confidence intervals, sample sizes, and appropriate statistical tests for the black-skin versus white-skin comparisons. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with no derivation chain or fitted predictions

full rationale

This is a dataset curation and model evaluation study. The central claim (poor generalizability to black/acral lesions) rests on direct performance measurements across datasets, not on any equation, parameter fit, or self-citation that reduces the result to its own inputs. No equations, uniqueness theorems, or ansatzes appear in the provided text. The skeptic concern about confounders is a validity issue, not a circularity issue. Score 0 is the appropriate finding for an honest empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the untested premise that the new acral dataset faithfully represents black-skin lesions and that skin-tone category is the dominant driver of performance difference; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The Fitzpatrick scale provides a sufficient proxy for skin-tone-related appearance variation in lesion images.
    Invoked when the authors assess datasets concerning the Fitzpatrick scale to verify performance on black skin.
  • domain assumption The selected supervised and self-supervised models are representative of current practice in skin-lesion classification.
    The evaluation treats these models as standard baselines without further justification in the abstract.

pith-pipeline@v0.9.0 · 5720 in / 1248 out tokens · 26916 ms · 2026-05-24T06:27:25.690520+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    Key statistics for melanoma skin cancer

    American Cancer Society. Key statistics for melanoma skin cancer. https://www. cancer.org/cancer/melanoma-skin-cancer/about/key-statistics.html , 2022

  2. [2]

    What is acral lentiginous melanoma? https://www.aimatmelanoma.org/melanoma-101/types-of-melanoma/ cutaneous-melanoma/acral-lentiginous-melanoma/

    AIM at Melanoma Foundation. What is acral lentiginous melanoma? https://www.aimatmelanoma.org/melanoma-101/types-of-melanoma/ cutaneous-melanoma/acral-lentiginous-melanoma/

  3. [3]

    Types of melanoma

    Memorial Sloan Kettering Cancer Center. Types of melanoma. https://www. mskcc.org/cancer-care/types/melanoma/types-melanoma, 2022

  4. [4]

    Melanoma acral-estudo cl´ ınico e epidemiol´ ogico.Surgical & Cosmetic Dermatology , 2020

    Yara Alves Caetano, Ana Maria Quinteiro Ribeiro, Bruno Ricardo da Silva Al- bernaz, Isabella de Paula Eleut´ erio, and Luiz Fernando Fleury Fr´ oes. Melanoma acral-estudo cl´ ınico e epidemiol´ ogico.Surgical & Cosmetic Dermatology , 2020

  5. [5]

    Dermatology has a problem with skin color

    Roni Caryn Rabin. Dermatology has a problem with skin color. https://www-nytimes-com.cdn.ampproject.org/c/s/www.nytimes.com/2020/ 08/30/health/skin-diseases-black-hispanic.amp.html , 2020

  6. [6]

    Knowledge transfer for melanoma screening with deep learning

    Afonso Menegola, Michel Fornaciali, Ramon Pires, Fl´ avia Vasques Bittencourt, Sandra Avila, and Eduardo Valle. Knowledge transfer for melanoma screening with deep learning. International Symposium on Biomedical Imaging , 2017

  7. [7]

    An evaluation of self-supervised pre-training for skin-lesion analysis

    Levy Chaves, Alceu Bissoto, Eduardo Valle, and Sandra Avila. An evaluation of self-supervised pre-training for skin-lesion analysis. In European Conference on Computer Vision Workshops , 2022

  8. [8]

    Decolonising dermatology: why black and brown skin need better treatment

    Neil Singh. Decolonising dermatology: why black and brown skin need better treatment. The Guardian, 13, 2020

  9. [9]

    Fitzpatrick skin phototype

    DermNet. Fitzpatrick skin phototype. https://dermnetnz.org/topics/ skin-phototype, 2012

  10. [10]

    Skin cancer in african-americans

    Dermatology Learning Network. Skin cancer in african-americans. https://www. hmpgloballearningnetwork.com/site/thederm/article/2547, 2004

  11. [11]

    Evaluating deep neural networks trained on clinical images in dermatology with the fitzpatrick 17k dataset

    Matthew Groh, Caleb Harris, Luis Soenksen, Felix Lau, Rachel Han, Aerin Kim, Arash Koochek, and Omar Badri. Evaluating deep neural networks trained on clinical images in dermatology with the fitzpatrick 17k dataset. In Conference on Computer Vision and Pattern Recognition , 2021

  12. [12]

    Acral melanoma detection using a convolu- tional neural network for dermoscopy images

    Chanki Yu, Sejung Yang, Wonoh Kim, Jinwoong Jung, Kee-Yang Chung, Sang Wook Lee, and Byungho Oh. Acral melanoma detection using a convolu- tional neural network for dermoscopy images. PloS one , 2018

  13. [13]

    Augmented decision-making for acral lentiginous melanoma detection using deep convolutional neural networks

    S Lee, YS Chu, SK Yoo, S Choi, SJ Choe, SB Koh, KY Chung, L Xing, B Oh, and S Yang. Augmented decision-making for acral lentiginous melanoma detection using deep convolutional neural networks. Journal of the European Academy of Dermatology and Venereology, 2020. 14 L. Barros et al

  14. [14]

    Acral melanoma detection using dermoscopic images and convolutional neural networks

    Qaiser Abbas, Farheen Ramzan, and Muhammad Usman Ghani. Acral melanoma detection using dermoscopic images and convolutional neural networks. Visual Computing for Industry, Biomedicine, and Art , 2021

  15. [15]

    Skin type diversity: a case study in skin lesion datasets

    Neda Alipour, Ted Burke, and Jane Courtney. Skin type diversity: a case study in skin lesion datasets. 2023

  16. [16]

    PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones

    Andre Pacheco, Gustavo Lima, Amanda Salom˜ ao, Breno Krohling, Igor Biral, Gabriel Angelo, F´ abio Jr, Jos´ e Esgario, Alana Simora, Pedro Castro, Felipe Ro- drigues, Patricia Frasson, et al. PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data in Brief , 2020

  17. [17]

    Novoa, Melissa Jenkins, Weixin Liang, Veronica Rotemberg, Justin Ko, et al

    Roxana Daneshjou, Kailas Vodrahalli, Roberto A. Novoa, Melissa Jenkins, Weixin Liang, Veronica Rotemberg, Justin Ko, et al. Disparities in dermatology ai perfor- mance on a diverse, curated clinical image set. Science Advances, 2022

  18. [18]

    Detecting melanoma fairly: Skin tone detection and debiasing for skin lesion classification

    Peter J Bevan and Amir Atapour-Abarghouei. Detecting melanoma fairly: Skin tone detection and debiasing for skin lesion classification. In MICCAI Workshop on Domain Adaptation and Representation Transfer , pages 1–11, 2022

  19. [19]

    Circle: Color invariant representation learning for unbiased classification of skin lesions

    Arezou Pakzad, Kumar Abhishek, and Ghassan Hamarneh. Circle: Color invariant representation learning for unbiased classification of skin lesions. In European Conference on Computer Vision , 2022

  20. [20]

    Improving skin color diversity in cancer detection: deep learning approach

    Eman Rezk, Mohamed Eltorki, Wael El-Dakhakhni, et al. Improving skin color diversity in cancer detection: deep learning approach. JMIR Dermatology , 5(3):e39143

  21. [21]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009

  22. [22]

    https://www.isic-archive.com, 2023

    ISIC Archive. https://www.isic-archive.com, 2023

  23. [23]

    Data, depth, and design: Learning reliable models for skin lesion analysis

    Eduardo Valle, Michel Fornaciali, Afonso Menegola, Julia Tavares, Fl´ avia Vasques Bittencourt, Lin Tzy Li, and Sandra Avila. Data, depth, and design: Learning reliable models for skin lesion analysis. Neurocomputing, 2020

  24. [24]

    Deep residual learn- ing for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn- ing for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

  25. [25]

    Boot- strap your own latent - a new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altch´ e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, et al. Boot- strap your own latent - a new approach to self-supervised learning. In Advances in Neural Information Processing Systems , 2020

  26. [26]

    What makes for good views for contrastive learning? In Advances in Neural Information Processing Systems , 2020

    Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? In Advances in Neural Information Processing Systems , 2020

  27. [27]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Conference on Com- puter Vision and Pattern Recognition , 2020

  28. [28]

    A sim- ple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A sim- ple framework for contrastive learning of visual representations. In International Conference on Machine Learning , 2020

  29. [29]

    Unsupervised learning of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems , 2020

  30. [30]

    Seven-point checklist and skin lesion classification using multitask multimodal neu- ral nets

    Jeremy Kawahara, Sara Daneshvar, Giuseppe Argenziano, and Ghassan Hamarneh. Seven-point checklist and skin lesion classification using multitask multimodal neu- ral nets. IEEE Journal of Biomedical and Health Informatics , 2019. Title Suppressed Due to Excessive Length 15

  31. [31]

    Usatine and Brian D

    Richard P. Usatine and Brian D. Madden. Interactive dermatology atlas. https: //www.dermatlas.net, 2023

  32. [32]

    https: //www.dermis.net/dermisroot/pt/home/index.htm, 2023

    Dermis.net: Dermatology information service available on the internet. https: //www.dermis.net/dermisroot/pt/home/index.htm, 2023

  33. [33]

    https://dermnetnz.org, 2023

    Dermnet resource. https://dermnetnz.org, 2023

  34. [34]

    Epiluminescence microscopy for the diagnosis of doubtful melanocytic skin lesions: comparison of the abcd rule of dermatoscopy and a new 7-point checklist based on pattern analysis

    Giuseppe Argenziano, Gabriella Fabbrocini, Paolo Carli, Vincenzo De Giorgi, Elena Sammarco, and Mario Delfino. Epiluminescence microscopy for the diagnosis of doubtful melanocytic skin lesions: comparison of the abcd rule of dermatoscopy and a new 7-point checklist based on pattern analysis. Archives of dermatology , 134(12):1563–1570, 1998

  35. [35]

    Dermaamin

    Jehad Amin AlKattash. Dermaamin. https://www.dermaamin.com

  36. [36]

    Atlas dermatologico

    Samuel Freire da Silva. Atlas dermatologico. http://atlasdermatologico.com. br. 16 L. Barros et al. Appendix A In this section, we stratified the results of Table 4 by Fitzpatrick scale. Tables A1, A2, and A3 shows the results for DDI, Fitzpatrick 17k, and PAD-UFES-20* datasets, respectively. Table A1: Evaluation metrics for DDI dataset. #Mel and #Ben ind...