pith. sign in

arxiv: 2606.13135 · v1 · pith:V7DMLLPUnew · submitted 2026-06-11 · 💻 cs.CV · cs.AI

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

Pith reviewed 2026-06-27 07:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords dermoscopic imagesskin neoplasmscascade classificationdeep learninggeneralization gapsensitivity controlexternal validationbinary triage
0
0 comments X

The pith

A cascade of binary triage followed by three-class differentiation allows tunable sensitivity for skin neoplasm images that single-stage models cannot achieve.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares four deep learning architectures on dermoscopic images of skin neoplasms using binary, single-stage four-class, and two-stage cascade schemes. It demonstrates that the cascade recovers malignant lesions often misassigned to the dominant benign class in single-stage setups and supplies an adjustable threshold for controlling sensitivity. Models trained on aggregated open ISIC data perform well internally but exhibit clear drops in ranking, sensitivity, and calibration on two independent Russian clinical datasets. The work concludes that the cascade better matches clinical differential-diagnosis logic yet requires external validation and recalibration prior to deployment.

Core claim

By evaluating ViT-B/16, Swin-S, ConvNeXt-S, and EfficientNetV2-S across binary, single-stage four-class, and cascade schemes on aggregated ISIC data, the paper shows that the cascade raises macro F1 over single-stage four-class classification for most architectures and significantly for ViT-B/16. The binary triage stage attains ROC-AUC 0.952-0.966 internally but drops to 0.797-0.893 on Sechenov University data, with sensitivity falling to 0.53-0.67 and ECE rising from 0.02 to 0.27-0.39. No architecture proves superior at the differentiation stage on clinical data, and direct 11-class classification on ISIC MILK10k yields mean-class sensitivity of 0.525.

What carries the argument

Two-stage cascade: binary malignant/benign triage with adjustable threshold, followed by three-class differentiation among malignant types (MEL, SCC, BCC).

If this is right

  • Cascade raises macro F1 over single-stage four-class classification for most architectures by recovering malignant lesions assigned to the benign class.
  • Tunable triage threshold supplies sensitivity control unattainable with standard single-stage argmax classification.
  • Binary stage ROC-AUC falls from 0.952-0.966 internally to 0.797-0.893 on external clinical data, with sensitivity declining to 0.53-0.67.
  • Calibration error rises sharply on external data, with malignancy underestimation quantified by ECE increasing to 0.27-0.39.
  • No architecture shows a proven advantage at the malignant differentiation stage on clinical data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The persistent gap between internal and external performance implies that domain adaptation or target-population data collection may be required before reliable clinical use.
  • The cascade structure could be tested on other imbalanced medical imaging tasks where rare positive cases must be separated from a large negative background.
  • Incorporating additional patient metadata or multi-modal inputs might narrow the observed generalization gap between open international and local clinical datasets.
  • Regulatory pathways for similar diagnostic tools would likely need to require independent external validation on representative populations.

Load-bearing premise

Aggregated open ISIC Archive data with ImageNet-pretrained weights provides a sufficient basis for models that transfer meaningfully to independent Russian clinical datasets without domain adaptation.

What would settle it

Showing that adjusting the triage threshold on the Sechenov University or Melanoscope AI datasets produces no improvement in macro F1 or sensitivity control compared with single-stage argmax classification would falsify the claimed advantage of the cascade.

Figures

Figures reproduced from arXiv: 2606.13135 by Aleksandr V. Kozachok, Elena S. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov, Sergey S. Seregin.

Figure 1
Figure 1. Figure 1: Two-stage cascade classification scheme. Stage 1 triages by [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Learning-rate schedule: linear warm-up followed by cosine annealing (AdamW, [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Learning curves (stage 2, MEL / SCC / BCC): (a) training (solid) and [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Binary-stage ROC curves (malignant / benign) for four architectures on three [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Binary-stage reliability diagrams: observed malignant fraction vs predicted [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Confusion matrix of ViT-B/16 in three-class differentiation [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of macro F1 of single-stage four-class and cascade [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end cascade confusion matrix, Sechenov University dataset ( [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript compares four deep learning architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) for dermoscopic skin neoplasm classification under three schemes: binary (malignant/benign), single-stage four-class (benign/MEL/SCC/BCC), and a two-stage cascade (binary triage then three-class differentiation). All models use ImageNet-pretrained weights and are trained on aggregated ISIC Archive data; evaluation occurs on an internal held-out sample plus two external Russian clinical datasets. Reported results include internal binary AUC of 0.952-0.966 dropping to 0.797-0.893 externally with sensitivity 0.53-0.67 and rising ECE, macro F1 gains for cascade over single-stage (significant only for ViT-B/16), and statistical tests confirming limited inter-architecture differences on clinical data. The conclusion states that a tunable triage threshold enables sensitivity control unattainable with standard single-stage argmax classification and better matches clinical logic, while the generalization gap requires external validation and recalibration.

Significance. If the central claims hold, the work supplies concrete empirical support for cascade schemes in medical image triage by quantifying sensitivity control and domain-shift effects via external validation on independent clinical data. Credit is due for reporting specific AUC/sensitivity/F1/ECE values, paired statistical tests, and the explicit quantification of the generalization gap (AUC drop and ECE rise from 0.02 to 0.27-0.39). These elements provide a falsifiable basis for the triage-threshold advantage and the call for recalibration.

major comments (1)
  1. [Results] Results: the central claim that 'a tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification' rests solely on comparison to argmax single-stage four-class models. No results are shown for single-stage four-class models whose output probabilities are thresholded (e.g., malignancy probability or per-class operating points) to achieve the same external sensitivity range (0.53-0.67); this comparison is required to substantiate that the reported control is unavailable in any single-stage formulation.
minor comments (2)
  1. [Abstract] Abstract: the specific data subset (internal vs. external) on which the macro F1 improvement reaches statistical significance for ViT-B/16 is not stated.
  2. [Methods] The manuscript does not detail whether the single-stage models were also evaluated under any form of probability thresholding, leaving the scope of the 'standard single-stage' baseline ambiguous.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on our results section. We address the point below and agree that additional comparisons will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Results] Results: the central claim that 'a tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification' rests solely on comparison to argmax single-stage four-class models. No results are shown for single-stage four-class models whose output probabilities are thresholded (e.g., malignancy probability or per-class operating points) to achieve the same external sensitivity range (0.53-0.67); this comparison is required to substantiate that the reported control is unavailable in any single-stage formulation.

    Authors: We agree that the referee's point is valid for fully substantiating the advantage of the cascade. While the manuscript explicitly frames its claim against the standard argmax single-stage four-class output (as stated in the conclusion), a comparison to single-stage models operated with probability thresholding is a natural extension. In the revised manuscript we will add results for single-stage four-class models where decision thresholds are adjusted on the output probabilities (both on the aggregated malignant probability and per-class operating points) to target the same external sensitivity range of 0.53-0.67. We will report the resulting macro F1, specificity, and calibration metrics alongside the cascade results. This will clarify whether the cascade provides sensitivity control that cannot be replicated by post-hoc thresholding in a single-stage formulation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons on held-out and external datasets

full rationale

The paper reports training and evaluation of four architectures under three classification schemes (binary, single-stage four-class, cascade) using ImageNet-pretrained weights on aggregated ISIC data, with metrics on internal held-out and two external clinical datasets. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear; the central claim about tunable triage thresholds is an empirical observation from direct comparisons to argmax baselines, not a reduction to inputs by construction. The generalization gap is quantified via explicit AUC/ECE drops rather than assumed away.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard supervised learning assumptions and the representativeness of the datasets used.

free parameters (1)
  • triage threshold
    Tunable parameter for sensitivity control in cascade scheme
axioms (2)
  • domain assumption ImageNet-pretrained weights and single augmentation protocol are sufficient for fair comparison across architectures
    Used for all models in the study
  • domain assumption The clinical datasets from Melanoscope and Sechenov University are independent and representative of Russian practice
    Used for external validation

pith-pipeline@v0.9.1-grok · 5917 in / 1364 out tokens · 35341 ms · 2026-06-27T07:20:19.932112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 17 canonical work pages · 4 internal anchors

  1. [1]

    Nature542(7639), 115–118 (2017).https: //doi.org/10.1038/nature21056

    Esteva A., Kuprel B., Novoa R.A., Ko J., Swetter S.M., Blau H.M., Thrun S. Dermatologist-level classification of skin can- cer with deep neural networks.Nature. 2017;542(7639):115–118. https://doi.org/10.1038/nature21056

  2. [2]

    Deep learn- ing outperformed 136 of 157 dermatologists in a head-to-head dermo- scopic melanoma image classification task.European Journal of Cancer

    Brinker T.J., Hekler A., Enk A.H., Berking C., Haferkamp S., Hauschild A., Weichenthal M., Klode J., Schadendorf D., Holland- Letz T., von Kalle C., Fröhling S., Schilling B., Utikal J.S. Deep learn- ing outperformed 136 of 157 dermatologists in a head-to-head dermo- scopic melanoma image classification task.European Journal of Cancer. 2019;113:47–54. htt...

  3. [3]

    Systematic outperformance of 112 derma- tologists in multiclass skin cancer image classification by convolu- tional neural networks.European Journal of Cancer

    Maron R.C., Weichenthal M., Utikal J.S., Hekler A., Berk- ing C., Hauschild A., Enk A.H., Haferkamp S., Klode J., Schaden- dorf D., Jansen P., Holland-Letz T., Schilling B., von Kalle C., 25 Fröhling S., Gaiser M.R., Hartmann D., Gesierich A., Käm- merer U., Brinker T.J. Systematic outperformance of 112 derma- tologists in multiclass skin cancer image cla...

  4. [4]

    The HAM10000 dataset, a large collection of multi-source dermatoscopic images of com- mon pigmented skin lesions.Scientific Data

    Tschandl P., Rosendahl C., Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of com- mon pigmented skin lesions.Scientific Data. 2018;5:180161. https://doi.org/10.1038/sdata.2018.161

  5. [5]

    Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

    Codella N., Rotemberg V., Tschandl P., Celebi M.E., Dusza S., Gut- man D., Helba B., Kalloo A., Liopyris K., Marchetti M., Kittler H., HalpernA.Skinlesionanalysistowardmelanomadetection2018:Achal- lenge hosted by the International Skin Imaging Collaboration (ISIC). arXiv:1902.03368. 2019. https://doi.org/10.48550/arXiv.1902.03368

  6. [6]

    An introduction to domain adaptation and trans- fer learning

    Kouw W.M., Loog M. An introduction to domain adaptation and trans- fer learning. arXiv:1812.11806. 2018

  7. [7]

    Dis- parities in dermatology AI performance on a diverse, cu- rated clinical image set.Science Advances

    Daneshjou R., Vodrahalli K., Novoa R.A., Jenkins M., Liang W., Rotemberg V., Ko J., Swetter S.M., Bailey E.E., Gevaert O., Mukherjee P., Phung M., Yekrang K., Fong B., Sahasrabudhe R., Allerup J.A.C., Okata-Karigane U., Zou J., Chiou A.S. Dis- parities in dermatology AI performance on a diverse, cu- rated clinical image set.Science Advances. 2022;8(31):ea...

  8. [8]

    Validation of AI prediction mod- els for skin cancer diagnosis using dermoscopy images: the 2019 ISIC grand challenge.The Lancet Digital Health

    Combalia M., Codella N., Rotemberg V., Carrera C., Dusza S., Gutman D., Helba B., Kittler H., Kurtansky N.R., Liopyris K., Marchetti M.A., Podlipnik S., Puig S., Rinner C., Tschandl P., We- ber J., Halpern A., Malvehy J. Validation of AI prediction mod- els for skin cancer diagnosis using dermoscopy images: the 2019 ISIC grand challenge.The Lancet Digital...

  9. [9]

    Rotemberg V., Kurtansky N., Betz-Stablein B. et al. A patient-centric dataset of images and metadata for identifying melanomas using clinical context.Scientific Data. 2021;8(1):34. https://doi.org/10.1038/s41597- 021-00815-z

  10. [10]

    Methodology for Creating a Clinically Verified Dermoscopic Image Dataset

    Kozachok E.S. Methodology for Creating a Clinically Verified Der- moscopic Image Dataset. Preprint. 2026. arXiv:2605.25168 [cs.CV]. https://doi.org/10.48550/arXiv.2605.25168. 26

  11. [11]

    Kozachok E.S. [A dermoscopic image dataset with high-quality an- notation of clinically significant features for diagnosis of melanocytic skin lesions].Izvestiya Yugo-Zapadnogo gosudarstvennogo universiteta. 2025;15(3):93–111.(InRuss.)https://doi.org/10.21869/2223-1536-2025- 15-3-93-111

  12. [12]

    Screening methodology for early differen- tial diagnosis of skin lesions using mobile dermoscopy.Vrach i informatsionnye tekhnologii

    Kozachok E.S., Seregin S.S., Kozachok A.V., Eletskiy K.V., Samovarov O.I. [Screening methodology for early differ- ential diagnosis of skin lesions using mobile dermoscopy]. Vrach i informatsionnye tekhnologii. 2025;(3):50–64. (In Russ.) https://doi.org/10.25881/18110193_2025_3_50

  13. [13]

    Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System

    Kozachok E.S., Seregin S.S. [Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System]. Preprint. 2026. arXiv:2605.27561 [cs.CV]. (In Russ.) https://doi.org/10.48550/arXiv.2605.27561

  14. [14]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., Uszkoreit J., Houlsby N. An image is worth 16×16 words: Trans- formers for image recognition at scale.Proceedings of ICLR. 2021. https://doi.org/10.48550/arXiv.2010.11929

  15. [15]

    Zhang, Y

    Liu Z., Lin Y., Cao Y., Hu H., Wei Y., Zhang Z., Lin S., Guo B. Swin Transformer: hierarchical vision transformer using shifted windows.Proceedings of IEEE/CVF ICCV. 2021:10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986

  16. [16]

    In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu Z., Mao H., Wu C.-Y., Feichtenhofer C., Darrell T., Xie S. A Con- vNet for the 2020s.Proceedings of IEEE/CVF CVPR. 2022:11976– 11986. https://doi.org/10.1109/CVPR52688.2022.01167

  17. [17]

    EfficientNetV2: smaller models and faster training.Pro- ceedings of ICML

    Tan M., Le Q. EfficientNetV2: smaller models and faster training.Pro- ceedings of ICML. 2021;139:10096–10106

  18. [18]

    DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification.Bioengineering

    Zhang X., Liu Y., Ouyang G., Chen W., Xu A., Hara T., Zhou X., Wu D. DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification.Bioengineering. 2025;12(4):421. https://doi.org/10.3390/bioengineering12040421

  19. [19]

    Hierarchical skin lesion image classification with prototypical decision tree.npj Digital Medicine

    Yu Z., et al. Hierarchical skin lesion image classification with prototypical decision tree.npj Digital Medicine. 2025;8:26. https://doi.org/10.1038/s41746-024-01395-z. 27

  20. [20]

    International Skin Imaging Col- laboration

    ISIC MILK10k Challenge. International Skin Imaging Col- laboration. 2024. Available from: https://challenge.isic- archive.com/leaderboards/milk10k/. 28