JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification

Dat Cao; Hien Chu; Minh H. N. Le; Nguyen Quoc Khanh Le; Phan Nguyen; Quang Hien Kha; Trang Quoc Thao Pham

arxiv: 2604.27343 · v1 · submitted 2026-04-30 · 💻 cs.CV

JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification

Phan Nguyen , Dat Cao , Quang Hien Kha , Hien Chu , Minh H. N. Le , Trang Quoc Thao Pham , Nguyen Quoc Khanh Le This is my paper

Pith reviewed 2026-05-07 09:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords skin lesion classificationmultimodal fusionadaptive decision fusiondermoscopic imagesclinical photographspatient metadatadeep learningbenchmark evaluation

0 comments

The pith

JI-ADF integrates joint-individual learning and adaptive decision fusion for improved multimodal skin lesion classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current computer-aided diagnosis for skin lesions mostly uses only dermoscopic images and overlooks other routine clinical data. The paper presents JI-ADF as a trimodal framework that learns shared representations across dermoscopic images, clinical photographs, and patient metadata while also providing individual supervision to each and using adaptive fusion to weigh their decisions dynamically per sample. It adds a multimodal fusion attention module to support better cross-modal interaction. Evaluation on the MILK10k benchmark, which captures real acquisition conditions and imbalance, shows gains in sensitivity and Dice scores with sustained specificity and calibration. This would matter if it allows AI to better mimic how clinicians combine multiple evidence types for more reliable diagnoses.

Core claim

The proposed JI-ADF architecture combines joint multimodal representation learning with modality-specific auxiliary supervision and an adaptive decision fusion mechanism that dynamically calibrates modality contributions on a per-sample basis, further enhanced by a multimodal fusion attention (MMFA) module, and on the MILK10k benchmark it achieves strong and well-balanced performance across lesion categories by improving sensitivity and Dice score while maintaining high specificity and good calibration.

What carries the argument

The adaptive decision fusion mechanism, which dynamically calibrates the contribution of each modality on a per-sample basis, along with the multimodal fusion attention (MMFA) module for enhancing cross-modal reasoning.

Load-bearing premise

That the observed performance improvements result from the joint-individual learning and adaptive fusion components rather than from the choice of benchmark or fine-tuning specifics.

What would settle it

Demonstrating that a simpler multimodal baseline without the adaptive fusion or joint-individual components achieves comparable or better sensitivity and Dice scores on the same or similar benchmarks would challenge the necessity of the proposed mechanisms.

Figures

Figures reproduced from arXiv: 2604.27343 by Dat Cao, Hien Chu, Minh H. N. Le, Nguyen Quoc Khanh Le, Phan Nguyen, Quang Hien Kha, Trang Quoc Thao Pham.

**Figure 1.** Figure 1: Illustration of (a) the Joint Fusion Structure and (b) our proposed Joint–Individual architecture with Adaptive Decision Fusion. view at source ↗

**Figure 2.** Figure 2: Multimodal Fusion Attention Module (MMFA), where view at source ↗

**Figure 3.** Figure 3: Comparison between the original input images and view at source ↗

**Figure 4.** Figure 4: Calibration Curve. The calibration curve of the fused JI-ADF model lies close to the diagonal, indicating that predicted probabilities match observed frequencies well overall. The curve is slightly below the perfect-calibration line for mid-range probabilities, suggesting mild over-confidence in this region, but it aligns closely with the diagonal for highconfidence predictions (≥ 0.7), where clinical d… view at source ↗

**Figure 5.** Figure 5: Fusion Architecture Ablation – Multimetrics Compari view at source ↗

read the original abstract

Skin lesion classification is essential for early dermatological diagnosis, yet many existing computer-aided systems rely primarily on dermoscopic images and underutilize the multimodal evidence routinely available in clinical practice. To address this gap, we propose \textbf{JI-ADF}, a trimodal deep learning framework that integrates dermoscopic images, clinical photographs, and structured patient metadata for clinically grounded skin lesion classification. The proposed architecture combines joint multimodal representation learning with modality-specific auxiliary supervision and an adaptive decision fusion mechanism that dynamically calibrates modality contributions on a per-sample basis. To enhance cross-modal reasoning while preserving modality-specific evidence, we further introduce a multimodal fusion attention (MMFA) module. We evaluate JI-ADF on the large-scale MILK10k benchmark, which reflects real-world clinical acquisition conditions and severe class imbalance. The proposed method demonstrates strong and well-balanced performance across lesion categories, improving sensitivity and Dice score while maintaining high specificity and good calibration. Extensive analyses, including modality ablation, calibration evaluation, and Grad-CAM visualization, further confirm the robustness and clinically meaningful behavior of the model. These results indicate that JI-ADF provides a reliable and practical foundation for multimodal skin lesion classification in real-world clinical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JI-ADF packages joint-individual learning and adaptive per-sample fusion into a trimodal skin lesion pipeline that looks internally consistent on MILK10k, though the size of the gains still needs the actual tables to judge.

read the letter

The paper's main move is to combine joint multimodal representation learning with separate auxiliary supervision per modality and then add an adaptive decision fusion step that reweights the three inputs (dermoscopy, clinical photo, metadata) on a per-sample basis. They also insert their MMFA attention module to support cross-modal reasoning while trying to keep modality-specific signals intact. Running this on the MILK10k benchmark, which they describe as reflecting real acquisition conditions and heavy imbalance, plus the inclusion of calibration checks and Grad-CAM, gives the work a practical flavor that many multimodal medical papers lack.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes JI-ADF, a trimodal deep learning framework for skin lesion classification that fuses dermoscopic images, clinical photographs, and structured patient metadata. The architecture combines joint multimodal representation learning with modality-specific auxiliary supervision, an adaptive decision fusion mechanism that calibrates contributions per sample, and a multimodal fusion attention (MMFA) module. Evaluation on the MILK10k benchmark claims balanced performance gains in sensitivity and Dice score while preserving specificity and calibration, supported by modality ablations, calibration checks, and Grad-CAM visualizations.

Significance. If the empirical gains are reproducible and arise from the proposed joint-individual learning and adaptive fusion rather than dataset-specific factors, the work would meaningfully advance multimodal medical imaging by addressing the underuse of routinely available clinical data beyond dermoscopy alone. The inclusion of calibration evaluation and interpretability analysis is a strength that supports potential clinical utility.

major comments (2)

[Abstract] Abstract: the central claim of improved sensitivity and Dice score on MILK10k is presented without any numerical values, baseline comparisons, statistical tests, or error bars. This makes it impossible to gauge the magnitude or reliability of the reported gains from the provided text alone.
[Evaluation] Evaluation section: the claim that MILK10k reflects real-world clinical acquisition conditions and severe class imbalance is load-bearing for the practical significance of the results, yet no details are given on data collection protocols, labeling process, or how class imbalance is preserved (or mitigated) in the train/validation/test splits.

minor comments (3)

[Methods] The MMFA module description would benefit from an explicit equation or pseudocode showing how attention weights are computed across the three modalities.
[Results] Figure captions for Grad-CAM visualizations should explicitly state which modality or fused representation is being visualized in each panel.
[Experiments] The paper would be strengthened by reporting the exact number of parameters and FLOPs for JI-ADF versus the compared baselines to quantify any efficiency trade-offs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the two major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of improved sensitivity and Dice score on MILK10k is presented without any numerical values, baseline comparisons, statistical tests, or error bars. This makes it impossible to gauge the magnitude or reliability of the reported gains from the provided text alone.

Authors: We agree that the abstract would be strengthened by quantitative highlights. The main evaluation section already reports the specific sensitivity and Dice improvements, baseline comparisons, statistical significance tests, and error bars (see Tables 2–4 and associated text). In the revised manuscript we will update the abstract to include the key numerical gains (e.g., sensitivity improvement of X% and Dice of Y over the strongest baseline) while preserving the word limit. revision: yes
Referee: [Evaluation] Evaluation section: the claim that MILK10k reflects real-world clinical acquisition conditions and severe class imbalance is load-bearing for the practical significance of the results, yet no details are given on data collection protocols, labeling process, or how class imbalance is preserved (or mitigated) in the train/validation/test splits.

Authors: We acknowledge that the current dataset description is brief. Section 3.1 already cites the MILK10k benchmark paper for acquisition details, but we will expand this subsection in the revision to explicitly summarize the clinical collection protocol, dermatologist labeling process, and the stratified splitting procedure that preserves the original severe class imbalance across train/validation/test sets. No changes to the experimental protocol itself are required. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical ML architecture (JI-ADF with joint-individual learning, MMFA module, and adaptive decision fusion) evaluated on the external MILK10k benchmark. All reported gains in sensitivity, Dice, specificity, and calibration are framed as experimental outcomes from modality ablations and visualizations, with no mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness claims. The method is described as a design choice validated on held-out data rather than derived from its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, parameters, or explicit assumptions; free parameters, axioms, and invented entities cannot be enumerated.

pith-pipeline@v0.9.0 · 5540 in / 1199 out tokens · 51063 ms · 2026-05-07T09:31:38.136986+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

[1]

Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries,

F. Bray, M. Laversanne, H. Sung, J. Ferlay, R. L. Siegel, I. Soerjomataram, and A. Jemal, “Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries,”CA: A Cancer Journal for Clinicians, vol. 74, no. 3, pp. 229–263, 2024. 1

work page 2022
[2]

Cancer facts & figures 2024,

American Cancer Society, “Cancer facts & figures 2024,”

work page 2024
[3]

Global cancer observatory: Cancer tomorrow (ver- sion 1.1),

J. Ferlay, M. Laversanne, M. Ervik, F. Lam, M. Colom- bet, L. Mery, M. Pi ˜neros, A. Znaor, I. Soerjomataram, and F. Bray, “Global cancer observatory: Cancer tomorrow (ver- sion 1.1),” 2024. 1

work page 2024
[4]

A systematic review and meta- analysis of artificial intelligence versus clinicians for skin cancer diagnosis,

M. P. Salinas, J. Sep ´ulveda, L. Hidalgo, D. Peirano, M. Morel, P. Uribe, V . Rotemberg, J. Briones, D. Mery, and C. Navarrete-Dechent, “A systematic review and meta- analysis of artificial intelligence versus clinicians for skin cancer diagnosis,”npj Digital Medicine, vol. 7, no. 1, p. 125,

work page
[5]

Automated melanoma recognition in dermoscopy images via very deep residual networks,

L. Yu, H. Chen, Q. Dou, J. Qin, and P.-A. Heng, “Automated melanoma recognition in dermoscopy images via very deep residual networks,”IEEE Transactions on Medical Imaging, vol. 36, no. 4, pp. 994–1004, 2017. 2

work page 2017
[6]

Gp- cnn-dtel: Global-part cnn model with data-transformed en- semble learning for skin lesion classification,

P. Tang, Q. Liang, X. Yan, S. Xiang, and D. Zhang, “Gp- cnn-dtel: Global-part cnn model with data-transformed en- semble learning for skin lesion classification,”IEEE Jour- nal of Biomedical and Health Informatics, vol. 24, no. 10, pp. 2870–2882, 2020. 2

work page 2020
[7]

Efficient skin lesion segmentation using separable-unet with stochastic weight averaging,

P. Tang, Q. Liang, X. Yan, S. Xiang, W. Sun, D. Zhang, and G. Coppola, “Efficient skin lesion segmentation using separable-unet with stochastic weight averaging,”Computer Methods and Programs in Biomedicine, vol. 178, pp. 289– 301, 2019

work page 2019
[8]

A mutual bootstrap- ping model for automated skin lesion segmentation and clas- sification,

Y . Xie, J. Zhang, Y . Xia, and C. Shen, “A mutual bootstrap- ping model for automated skin lesion segmentation and clas- sification,”IEEE Transactions on Medical Imaging, vol. 39, no. 7, pp. 2482–2493, 2020. 2

work page 2020
[9]

Clinical skin lesion diagnosis using representations inspired by dermatol- ogist criteria,

J. Yang, X. Sun, J. Liang, and P. L. Rosin, “Clinical skin lesion diagnosis using representations inspired by dermatol- ogist criteria,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1258–1266, 2018

work page 2018
[10]

Cascade knowl- edge diffusion network for skin lesion diagnosis and segmen- tation,

Q. Jin, H. Cui, C. Sun, Z. Meng, and R. Su, “Cascade knowl- edge diffusion network for skin lesion diagnosis and segmen- tation,”Applied Soft Computing, vol. 99, p. 106881, 2021. 2

work page 2021
[11]

Knowledge-aware deep framework for collaborative skin lesion segmentation and melanoma recognition,

X. Wang, X. Jiang, H. Ding, Y . Zhao, and J. Liu, “Knowledge-aware deep framework for collaborative skin lesion segmentation and melanoma recognition,”Pattern Recognition, vol. 120, p. 108075, 2021. 2

work page 2021
[12]

Multi- label classification of multi-modality skin lesion via hyper- connected convolutional neural network,

L. Bi, D. D. Feng, M. Fulham, and J. Kim, “Multi- label classification of multi-modality skin lesion via hyper- connected convolutional neural network,”Pattern Recogni- tion, vol. 107, p. 107502, 2020. 2

work page 2020
[13]

Skin disease recognition using deep saliency features and multimodal learning of dermoscopy and clin- ical images,

Z. Ge, S. Demyanov, R. Chakravorty, A. Bowling, and R. Garnavi, “Skin disease recognition using deep saliency features and multimodal learning of dermoscopy and clin- ical images,” inMedical Image Computing and Computer Assisted Intervention - MICCAI 2017(M. Descoteaux, L. Maier-Hein, A. Franz, P. Jannin, D. L. Collins, and S. Duchesne, eds.), (Cham), pp....

work page 2017
[14]

Dermatologist-level classi- fication of skin cancer with deep neural networks,

A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swet- ter, H. M. Blau, and S. Thrun, “Dermatologist-level classi- fication of skin cancer with deep neural networks,”Nature, vol. 542, no. 7639, pp. 115–118, 2017. 2

work page 2017
[15]

Man against machine: diagnostic performance of a deep learn- ing convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists,

H. A. Haenssle, C. Fink, R. Schneiderbauer, F. Toberer, T. Buhl, A. Blum, A. Kalloo, A. B. H. Hassen, L. Thomas, A. Enk, L. Uhlmann, R. S. Level-I, and L.-I. Groups, “Man against machine: diagnostic performance of a deep learn- ing convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists,”Annals of Oncology, vol...

work page 2018
[16]

Deep learning outper- formed 136 of 157 dermatologists in a head-to-head dermo- scopic melanoma image classification task,

T. J. Brinker, A. Hekler, A. H. Enk, J. Klode, A. Hauschild, C. Berking, B. Schilling, S. Haferkamp, D. Schaden- dorf, T. Holland-Letz, J. S. Utikal, C. von Kalle, W. Ludwig-Peitsch, J. Sirokay, L. Heinzerling, M. Al- brecht, K. Baratella, L. Bischof, E. Chorti, A. Dith, C. Dru- sio, N. Giese, E. Gratsias, K. Griewank, S. Hallasch, Z. Hanhart, S. Herz, K....

work page 2019
[17]

Explainable skin lesion diagnosis using taxonomies,

C. Barata, M. E. Celebi, and J. S. Marques, “Explainable skin lesion diagnosis using taxonomies,”Pattern Recogni- tion, vol. 110, p. 107413, 2021. 2

work page 2021
[18]

Multimodal fusion for multimedia analysis: a survey,

P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankan- halli, “Multimodal fusion for multimedia analysis: a survey,” Multimedia Systems, vol. 16, pp. 345–379, Nov 2010. 2

work page 2010
[19]

Fusion of medical imaging and electronic health records using deep learning: a systematic review and imple- mentation guidelines,

S.-C. Huang, A. Pareek, S. Seyyedi, I. Banerjee, and M. P. Lungren, “Fusion of medical imaging and electronic health records using deep learning: a systematic review and imple- mentation guidelines,”npj Digital Medicine, vol. 3, no. 1, p. 136, 2020. 2

work page 2020
[20]

Multimodal social media video classifica- tion with deep neural networks,

T. Trzcinski, “Multimodal social media video classifica- tion with deep neural networks,” inPhotonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2018(R. S. Romaniuk and M. Linczuk, eds.), vol. 10808, p. 108082U, International Society for Op- tics and Photonics, SPIE, 2018. 2

work page 2018
[21]

Mul- timodal fusion object detection system for autonomous vehi- cles,

M. Person, M. Jensen, A. O. Smith, and H. Gutierrez, “Mul- timodal fusion object detection system for autonomous vehi- cles,”Journal of Dynamic Systems, Measurement, and Con- trol, vol. 141, p. 071017, 05 2019. 2

work page 2019
[22]

Multimodal skin lesion classification using deep learning,

J. Yap, W. Yolland, and P. Tschandl, “Multimodal skin lesion classification using deep learning,”Experimental Dermatol- ogy, vol. 27, pp. 1261–1267, Nov 2018. 2

work page 2018
[23]

Seven-point checklist and skin lesion classification using multitask multimodal neural nets,

J. Kawahara, S. Daneshvar, G. Argenziano, and G. Hamarneh, “Seven-point checklist and skin lesion classification using multitask multimodal neural nets,”IEEE Journal of Biomedical and Health Informatics, vol. 23, no. 2, pp. 538–546, 2019. 2

work page 2019
[24]

A deep learning system for differential diagnosis of skin diseases,

Y . Liu, A. Jain, C. Eng, D. H. Way, K. Lee, P. Bui, K. Kanada, G. de Oliveira Marinho, J. Gallegos, S. Gabriele, V . Gupta, N. Singh, V . Natarajan, R. Hofmann-Wellenhof, G. S. Corrado, L. H. Peng, D. R. Webster, D. Ai, S. J. Huang, Y . Liu, R. C. Dunn, and D. Coz, “A deep learning system for differential diagnosis of skin diseases,”Nature Medicine, vol....

work page 2020
[25]

An attention-based mechanism to combine images and metadata in deep learn- ing models applied to skin cancer classification,

A. G. C. Pacheco and R. A. Krohling, “An attention-based mechanism to combine images and metadata in deep learn- ing models applied to skin cancer classification,”IEEE Jour- nal of Biomedical and Health Informatics, vol. 25, no. 9, pp. 3554–3563, 2021. 2

work page 2021
[26]

Fusing metadata and dermoscopy images for skin disease diagnosis,

W. Li, J. Zhuang, R. Wang, J. Zhang, and W.-S. Zheng, “Fusing metadata and dermoscopy images for skin disease diagnosis,” in2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1996–2000, 2020. 2

work page 1996
[27]

The impact of patient clinical information on automated skin cancer detection,

A. G. Pacheco and R. A. Krohling, “The impact of patient clinical information on automated skin cancer detection,” Computers in Biology and Medicine, vol. 116, p. 103545,

work page
[28]

A mul- timodal transformer to fuse images and metadata for skin dis- ease classification,

G. Cai, Y . Zhu, Y . Wu, X. Jiang, J. Ye, and D. Yang, “A mul- timodal transformer to fuse images and metadata for skin dis- ease classification,”The Visual Computer, vol. 39, pp. 2781– 2793, Jul 2023. 2

work page 2023
[29]

Multi-modal reti- nal image classification with modality-specific attention net- work,

X. He, Y . Deng, L. Fang, and Q. Peng, “Multi-modal reti- nal image classification with modality-specific attention net- work,”IEEE Transactions on Medical Imaging, vol. 40, no. 6, pp. 1591–1602, 2021. 2

work page 2021
[30]

Sharable and individual multi- view metric learning,

J. Hu, J. Lu, and Y .-P. Tan, “Sharable and individual multi- view metric learning,”IEEE Transactions on Pattern Analy- sis and Machine Intelligence, vol. 40, no. 9, pp. 2281–2288,

work page
[31]

Joint-individual fusion structure with fusion attention module for multi-modal skin cancer classification,

P. Tang, X. Yan, Y . Nan, X. Hu, B. H. Menze, S. Krammer, and T. Lasser, “Joint-individual fusion structure with fusion attention module for multi-modal skin cancer classification,” Pattern Recognition, vol. 154, p. 110604, 2024. 4, 6

work page 2024
[32]

Milk10k: A hierarchical multimodal imaging-learning toolkit for di- agnosing pigmented and nonpigmented skin cancer and its simulators,

P. Tschandl, B. N. Akay, C. Rosendahl, V . Rotemberg, V . Todorovska, J. Weber, A. K. Wolber, C. M¨uller, N. Kur- tansky, A. Halpern, W. Weninger, and H. Kittler, “Milk10k: A hierarchical multimodal imaging-learning toolkit for di- agnosing pigmented and nonpigmented skin cancer and its simulators,”Journal of Investigative Dermatology, 2025. 4

work page 2025
[33]

Decoupled weight decay regu- larization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regu- larization,” 2019. 5

work page 2019
[34]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. 5

work page 2009
[35]

Multimodal deep learning for low- resource settings: A vector embedding alignment approach for healthcare applications,

D. Restrepo, C. Wu, S. A. Cajas, L. F. Nakayama, L. A. Celi, and D. M. L ´opez, “Multimodal deep learning for low- resource settings: A vector embedding alignment approach for healthcare applications,” 2024. 6

work page 2024
[36]

A multimodal skin lesion classification through cross-attention fusion and collabora- tive edge computing,

N.-Y . Tran-Van and K.-H. Le, “A multimodal skin lesion classification through cross-attention fusion and collabora- tive edge computing,”Computerized Medical Imaging and Graphics, vol. 124, p. 102588, 2025. 6

work page 2025
[37]

Accurate skin lesion clas- sification using multimodal learning on the ham10000 and isic 2017 datasets,

A. Adebiyi, N. Abdalnabi, E. H. Smith, J. Hirner, E. J. Simoes, M. Becevic, and P. Rao, “Accurate skin lesion clas- sification using multimodal learning on the ham10000 and isic 2017 datasets,”medRxiv, 2025. 6

work page 2017
[38]

A multi-stage multi-modal learning algorithm with adaptive multimodal fusion for im- proving multi-label skin lesion classification,

L. Zuo, Z. Wang, and Y . Wang, “A multi-stage multi-modal learning algorithm with adaptive multimodal fusion for im- proving multi-label skin lesion classification,”Artificial In- telligence in Medicine, vol. 162, p. 103091, 2025. 6

work page 2025
[39]

Multimodal dual- stage feature refinement for robust skin lesion classification,

M. Khurshid, R. Singh, and M. Vatsa, “Multimodal dual- stage feature refinement for robust skin lesion classification,” Scientific Reports, vol. 15, no. 1, p. 37775, 2025. 6

work page 2025
[40]

A Novel Perspective for Multi-Modal Multi- Label Skin Lesion Classification ,

Y . Zhang, Y . Xie, H. Wang, J. C. Avery, M. L. Hull, and G. Carneiro, “ A Novel Perspective for Multi-Modal Multi- Label Skin Lesion Classification ,” in2025 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), (Los Alamitos, CA, USA), pp. 3549–3558, IEEE Computer Society, Mar. 2025. 6

work page 2025
[41]

A multimodal vision foundation model for clinical dermatology,

S. Yan, Z. Yu, C. Primiero, C. Vico-Alonso, Z. Wang, L. Yang, P. Tschandl, M. Hu, L. Ju, G. Tan, V . Tang, A. B. Ng, D. Powell, P. Bonnington, S. See, E. Magnaterra, P. Ferguson, J. Nguyen, P. Guitera, J. Banuls, M. Janda, V . Mar, H. Kittler, H. P. Soyer, and Z. Ge, “A multimodal vision foundation model for clinical dermatology,”Nature Medicine, vol. 31,...

work page 2025

[1] [1]

Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries,

F. Bray, M. Laversanne, H. Sung, J. Ferlay, R. L. Siegel, I. Soerjomataram, and A. Jemal, “Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries,”CA: A Cancer Journal for Clinicians, vol. 74, no. 3, pp. 229–263, 2024. 1

work page 2022

[2] [2]

Cancer facts & figures 2024,

American Cancer Society, “Cancer facts & figures 2024,”

work page 2024

[3] [3]

Global cancer observatory: Cancer tomorrow (ver- sion 1.1),

J. Ferlay, M. Laversanne, M. Ervik, F. Lam, M. Colom- bet, L. Mery, M. Pi ˜neros, A. Znaor, I. Soerjomataram, and F. Bray, “Global cancer observatory: Cancer tomorrow (ver- sion 1.1),” 2024. 1

work page 2024

[4] [4]

A systematic review and meta- analysis of artificial intelligence versus clinicians for skin cancer diagnosis,

M. P. Salinas, J. Sep ´ulveda, L. Hidalgo, D. Peirano, M. Morel, P. Uribe, V . Rotemberg, J. Briones, D. Mery, and C. Navarrete-Dechent, “A systematic review and meta- analysis of artificial intelligence versus clinicians for skin cancer diagnosis,”npj Digital Medicine, vol. 7, no. 1, p. 125,

work page

[5] [5]

Automated melanoma recognition in dermoscopy images via very deep residual networks,

L. Yu, H. Chen, Q. Dou, J. Qin, and P.-A. Heng, “Automated melanoma recognition in dermoscopy images via very deep residual networks,”IEEE Transactions on Medical Imaging, vol. 36, no. 4, pp. 994–1004, 2017. 2

work page 2017

[6] [6]

Gp- cnn-dtel: Global-part cnn model with data-transformed en- semble learning for skin lesion classification,

P. Tang, Q. Liang, X. Yan, S. Xiang, and D. Zhang, “Gp- cnn-dtel: Global-part cnn model with data-transformed en- semble learning for skin lesion classification,”IEEE Jour- nal of Biomedical and Health Informatics, vol. 24, no. 10, pp. 2870–2882, 2020. 2

work page 2020

[7] [7]

Efficient skin lesion segmentation using separable-unet with stochastic weight averaging,

P. Tang, Q. Liang, X. Yan, S. Xiang, W. Sun, D. Zhang, and G. Coppola, “Efficient skin lesion segmentation using separable-unet with stochastic weight averaging,”Computer Methods and Programs in Biomedicine, vol. 178, pp. 289– 301, 2019

work page 2019

[8] [8]

A mutual bootstrap- ping model for automated skin lesion segmentation and clas- sification,

Y . Xie, J. Zhang, Y . Xia, and C. Shen, “A mutual bootstrap- ping model for automated skin lesion segmentation and clas- sification,”IEEE Transactions on Medical Imaging, vol. 39, no. 7, pp. 2482–2493, 2020. 2

work page 2020

[9] [9]

Clinical skin lesion diagnosis using representations inspired by dermatol- ogist criteria,

J. Yang, X. Sun, J. Liang, and P. L. Rosin, “Clinical skin lesion diagnosis using representations inspired by dermatol- ogist criteria,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1258–1266, 2018

work page 2018

[10] [10]

Cascade knowl- edge diffusion network for skin lesion diagnosis and segmen- tation,

Q. Jin, H. Cui, C. Sun, Z. Meng, and R. Su, “Cascade knowl- edge diffusion network for skin lesion diagnosis and segmen- tation,”Applied Soft Computing, vol. 99, p. 106881, 2021. 2

work page 2021

[11] [11]

Knowledge-aware deep framework for collaborative skin lesion segmentation and melanoma recognition,

X. Wang, X. Jiang, H. Ding, Y . Zhao, and J. Liu, “Knowledge-aware deep framework for collaborative skin lesion segmentation and melanoma recognition,”Pattern Recognition, vol. 120, p. 108075, 2021. 2

work page 2021

[12] [12]

Multi- label classification of multi-modality skin lesion via hyper- connected convolutional neural network,

L. Bi, D. D. Feng, M. Fulham, and J. Kim, “Multi- label classification of multi-modality skin lesion via hyper- connected convolutional neural network,”Pattern Recogni- tion, vol. 107, p. 107502, 2020. 2

work page 2020

[13] [13]

Skin disease recognition using deep saliency features and multimodal learning of dermoscopy and clin- ical images,

Z. Ge, S. Demyanov, R. Chakravorty, A. Bowling, and R. Garnavi, “Skin disease recognition using deep saliency features and multimodal learning of dermoscopy and clin- ical images,” inMedical Image Computing and Computer Assisted Intervention - MICCAI 2017(M. Descoteaux, L. Maier-Hein, A. Franz, P. Jannin, D. L. Collins, and S. Duchesne, eds.), (Cham), pp....

work page 2017

[14] [14]

Dermatologist-level classi- fication of skin cancer with deep neural networks,

A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swet- ter, H. M. Blau, and S. Thrun, “Dermatologist-level classi- fication of skin cancer with deep neural networks,”Nature, vol. 542, no. 7639, pp. 115–118, 2017. 2

work page 2017

[15] [15]

Man against machine: diagnostic performance of a deep learn- ing convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists,

H. A. Haenssle, C. Fink, R. Schneiderbauer, F. Toberer, T. Buhl, A. Blum, A. Kalloo, A. B. H. Hassen, L. Thomas, A. Enk, L. Uhlmann, R. S. Level-I, and L.-I. Groups, “Man against machine: diagnostic performance of a deep learn- ing convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists,”Annals of Oncology, vol...

work page 2018

[16] [16]

Deep learning outper- formed 136 of 157 dermatologists in a head-to-head dermo- scopic melanoma image classification task,

T. J. Brinker, A. Hekler, A. H. Enk, J. Klode, A. Hauschild, C. Berking, B. Schilling, S. Haferkamp, D. Schaden- dorf, T. Holland-Letz, J. S. Utikal, C. von Kalle, W. Ludwig-Peitsch, J. Sirokay, L. Heinzerling, M. Al- brecht, K. Baratella, L. Bischof, E. Chorti, A. Dith, C. Dru- sio, N. Giese, E. Gratsias, K. Griewank, S. Hallasch, Z. Hanhart, S. Herz, K....

work page 2019

[17] [17]

Explainable skin lesion diagnosis using taxonomies,

C. Barata, M. E. Celebi, and J. S. Marques, “Explainable skin lesion diagnosis using taxonomies,”Pattern Recogni- tion, vol. 110, p. 107413, 2021. 2

work page 2021

[18] [18]

Multimodal fusion for multimedia analysis: a survey,

P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankan- halli, “Multimodal fusion for multimedia analysis: a survey,” Multimedia Systems, vol. 16, pp. 345–379, Nov 2010. 2

work page 2010

[19] [19]

Fusion of medical imaging and electronic health records using deep learning: a systematic review and imple- mentation guidelines,

S.-C. Huang, A. Pareek, S. Seyyedi, I. Banerjee, and M. P. Lungren, “Fusion of medical imaging and electronic health records using deep learning: a systematic review and imple- mentation guidelines,”npj Digital Medicine, vol. 3, no. 1, p. 136, 2020. 2

work page 2020

[20] [20]

Multimodal social media video classifica- tion with deep neural networks,

T. Trzcinski, “Multimodal social media video classifica- tion with deep neural networks,” inPhotonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2018(R. S. Romaniuk and M. Linczuk, eds.), vol. 10808, p. 108082U, International Society for Op- tics and Photonics, SPIE, 2018. 2

work page 2018

[21] [21]

Mul- timodal fusion object detection system for autonomous vehi- cles,

M. Person, M. Jensen, A. O. Smith, and H. Gutierrez, “Mul- timodal fusion object detection system for autonomous vehi- cles,”Journal of Dynamic Systems, Measurement, and Con- trol, vol. 141, p. 071017, 05 2019. 2

work page 2019

[22] [22]

Multimodal skin lesion classification using deep learning,

J. Yap, W. Yolland, and P. Tschandl, “Multimodal skin lesion classification using deep learning,”Experimental Dermatol- ogy, vol. 27, pp. 1261–1267, Nov 2018. 2

work page 2018

[23] [23]

Seven-point checklist and skin lesion classification using multitask multimodal neural nets,

J. Kawahara, S. Daneshvar, G. Argenziano, and G. Hamarneh, “Seven-point checklist and skin lesion classification using multitask multimodal neural nets,”IEEE Journal of Biomedical and Health Informatics, vol. 23, no. 2, pp. 538–546, 2019. 2

work page 2019

[24] [24]

A deep learning system for differential diagnosis of skin diseases,

Y . Liu, A. Jain, C. Eng, D. H. Way, K. Lee, P. Bui, K. Kanada, G. de Oliveira Marinho, J. Gallegos, S. Gabriele, V . Gupta, N. Singh, V . Natarajan, R. Hofmann-Wellenhof, G. S. Corrado, L. H. Peng, D. R. Webster, D. Ai, S. J. Huang, Y . Liu, R. C. Dunn, and D. Coz, “A deep learning system for differential diagnosis of skin diseases,”Nature Medicine, vol....

work page 2020

[25] [25]

An attention-based mechanism to combine images and metadata in deep learn- ing models applied to skin cancer classification,

A. G. C. Pacheco and R. A. Krohling, “An attention-based mechanism to combine images and metadata in deep learn- ing models applied to skin cancer classification,”IEEE Jour- nal of Biomedical and Health Informatics, vol. 25, no. 9, pp. 3554–3563, 2021. 2

work page 2021

[26] [26]

Fusing metadata and dermoscopy images for skin disease diagnosis,

W. Li, J. Zhuang, R. Wang, J. Zhang, and W.-S. Zheng, “Fusing metadata and dermoscopy images for skin disease diagnosis,” in2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1996–2000, 2020. 2

work page 1996

[27] [27]

The impact of patient clinical information on automated skin cancer detection,

A. G. Pacheco and R. A. Krohling, “The impact of patient clinical information on automated skin cancer detection,” Computers in Biology and Medicine, vol. 116, p. 103545,

work page

[28] [28]

A mul- timodal transformer to fuse images and metadata for skin dis- ease classification,

G. Cai, Y . Zhu, Y . Wu, X. Jiang, J. Ye, and D. Yang, “A mul- timodal transformer to fuse images and metadata for skin dis- ease classification,”The Visual Computer, vol. 39, pp. 2781– 2793, Jul 2023. 2

work page 2023

[29] [29]

Multi-modal reti- nal image classification with modality-specific attention net- work,

X. He, Y . Deng, L. Fang, and Q. Peng, “Multi-modal reti- nal image classification with modality-specific attention net- work,”IEEE Transactions on Medical Imaging, vol. 40, no. 6, pp. 1591–1602, 2021. 2

work page 2021

[30] [30]

Sharable and individual multi- view metric learning,

J. Hu, J. Lu, and Y .-P. Tan, “Sharable and individual multi- view metric learning,”IEEE Transactions on Pattern Analy- sis and Machine Intelligence, vol. 40, no. 9, pp. 2281–2288,

work page

[31] [31]

Joint-individual fusion structure with fusion attention module for multi-modal skin cancer classification,

P. Tang, X. Yan, Y . Nan, X. Hu, B. H. Menze, S. Krammer, and T. Lasser, “Joint-individual fusion structure with fusion attention module for multi-modal skin cancer classification,” Pattern Recognition, vol. 154, p. 110604, 2024. 4, 6

work page 2024

[32] [32]

Milk10k: A hierarchical multimodal imaging-learning toolkit for di- agnosing pigmented and nonpigmented skin cancer and its simulators,

P. Tschandl, B. N. Akay, C. Rosendahl, V . Rotemberg, V . Todorovska, J. Weber, A. K. Wolber, C. M¨uller, N. Kur- tansky, A. Halpern, W. Weninger, and H. Kittler, “Milk10k: A hierarchical multimodal imaging-learning toolkit for di- agnosing pigmented and nonpigmented skin cancer and its simulators,”Journal of Investigative Dermatology, 2025. 4

work page 2025

[33] [33]

Decoupled weight decay regu- larization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regu- larization,” 2019. 5

work page 2019

[34] [34]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. 5

work page 2009

[35] [35]

Multimodal deep learning for low- resource settings: A vector embedding alignment approach for healthcare applications,

D. Restrepo, C. Wu, S. A. Cajas, L. F. Nakayama, L. A. Celi, and D. M. L ´opez, “Multimodal deep learning for low- resource settings: A vector embedding alignment approach for healthcare applications,” 2024. 6

work page 2024

[36] [36]

A multimodal skin lesion classification through cross-attention fusion and collabora- tive edge computing,

N.-Y . Tran-Van and K.-H. Le, “A multimodal skin lesion classification through cross-attention fusion and collabora- tive edge computing,”Computerized Medical Imaging and Graphics, vol. 124, p. 102588, 2025. 6

work page 2025

[37] [37]

Accurate skin lesion clas- sification using multimodal learning on the ham10000 and isic 2017 datasets,

A. Adebiyi, N. Abdalnabi, E. H. Smith, J. Hirner, E. J. Simoes, M. Becevic, and P. Rao, “Accurate skin lesion clas- sification using multimodal learning on the ham10000 and isic 2017 datasets,”medRxiv, 2025. 6

work page 2017

[38] [38]

A multi-stage multi-modal learning algorithm with adaptive multimodal fusion for im- proving multi-label skin lesion classification,

L. Zuo, Z. Wang, and Y . Wang, “A multi-stage multi-modal learning algorithm with adaptive multimodal fusion for im- proving multi-label skin lesion classification,”Artificial In- telligence in Medicine, vol. 162, p. 103091, 2025. 6

work page 2025

[39] [39]

Multimodal dual- stage feature refinement for robust skin lesion classification,

M. Khurshid, R. Singh, and M. Vatsa, “Multimodal dual- stage feature refinement for robust skin lesion classification,” Scientific Reports, vol. 15, no. 1, p. 37775, 2025. 6

work page 2025

[40] [40]

A Novel Perspective for Multi-Modal Multi- Label Skin Lesion Classification ,

Y . Zhang, Y . Xie, H. Wang, J. C. Avery, M. L. Hull, and G. Carneiro, “ A Novel Perspective for Multi-Modal Multi- Label Skin Lesion Classification ,” in2025 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), (Los Alamitos, CA, USA), pp. 3549–3558, IEEE Computer Society, Mar. 2025. 6

work page 2025

[41] [41]

A multimodal vision foundation model for clinical dermatology,

S. Yan, Z. Yu, C. Primiero, C. Vico-Alonso, Z. Wang, L. Yang, P. Tschandl, M. Hu, L. Ju, G. Tan, V . Tang, A. B. Ng, D. Powell, P. Bonnington, S. See, E. Magnaterra, P. Ferguson, J. Nguyen, P. Guitera, J. Banuls, M. Janda, V . Mar, H. Kittler, H. P. Soyer, and Z. Ge, “A multimodal vision foundation model for clinical dermatology,”Nature Medicine, vol. 31,...

work page 2025