Entropy-Guided Self-Supervised Learning for Medical Image Classification

Joao Florindo; Viviane Moura

arxiv: 2605.21970 · v1 · pith:BI4M5YDSnew · submitted 2026-05-21 · 📡 eess.IV · cs.CV

Entropy-Guided Self-Supervised Learning for Medical Image Classification

Joao Florindo , Viviane Moura This is my paper

Pith reviewed 2026-05-22 03:13 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords self-supervised learningmedical image classificationmasked autoencoderensemble learningtransfer learningConvNeXtentropy-guided pretraining

0 comments

The pith

An ensemble averaging predictions from ImageNet-pretrained and entropy-guided MAE-pretrained ConvNeXt models improves medical image classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that pairs two ConvNeXt-Tiny models to handle scarce labels and subtle class differences in medical images. One model draws general features from ImageNet pre-training while the other learns domain-specific representations through entropy-guided masked autoencoding on the target medical data. Both are fine-tuned on the classification tasks, then their probability outputs are averaged to form the final prediction. Experiments on breast ultrasound, skin lesion, gastrointestinal, and COVID datasets show the combined system exceeds the accuracy of either model alone and of prior methods.

Core claim

The paper claims that pre-training one ConvNeXt-Tiny on ImageNet and a second on the medical dataset via entropy-guided masked autoencoding, followed by fine-tuning and simple probability averaging, produces complementary features that yield state-of-the-art classification accuracy and robustness on the BUSI, ISIC 2018, Kvasir, and COVID-19 datasets.

What carries the argument

Ensemble formed by averaging the predicted probabilities of an ImageNet-pretrained ConvNeXt-Tiny and an entropy-guided MAE-pretrained ConvNeXt-Tiny after task-specific fine-tuning.

If this is right

Domain-specific MAE pre-training supplies features that complement the general features from ImageNet.
Probability averaging produces higher accuracy than either model used alone.
The method reaches state-of-the-art results on four distinct medical imaging modalities.
Combining broad and narrow pre-training strategies mitigates limited annotated data and high intra-class variability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-pretraining pattern could be tried with other backbone families such as Vision Transformers.
Entropy-guided masking might be tested inside other self-supervised objectives beyond MAE.
Weighted fusion or learned combination layers could replace simple averaging in future variants.
The approach may transfer to other data-scarce domains such as satellite or microscopy images.

Load-bearing premise

The two pre-trained models supply complementary features that are effectively combined simply by averaging their predicted probabilities.

What would settle it

If the ensemble fails to outperform both individual models and existing methods on a fifth independent medical imaging dataset, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.21970 by Joao Florindo, Viviane Moura.

**Figure 2.** Figure 2: Accuracy evolution of the Ensemble model varying the proportion of epochs [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

read the original abstract

Accurate and robust medical image classification is paramount for early disease diagnosis and treatment planning. However, challenges such as limited annotated data, high intra-class variability, and subtle inter-class differences often hinder the performance of deep learning models. This paper introduces a synergistic deep learning framework that leverages the strengths of self-supervised learning and transfer learning for enhanced medical image classification. Our approach employs two distinct ConvNeXt-Tiny models: one pre-trained on a large-scale natural image dataset (ImageNet) and another pre-trained using an entropy-guided Masked Autoencoder (MAE) on the target medical dataset. Both models are then fine-tuned on specific medical image classification tasks. A final ensemble strategy, based on averaging predicted probabilities, is utilized to combine the complementary insights from these two models. Rigorous experimental validation across four diverse medical imaging datasets (Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC) 2018, Kvasir, and COVID) demonstrates the superior performance and robustness of our ensemble approach. The MAE pre-training significantly improves feature learning on domain-specific data, while the ImageNet pre-training provides strong generalizable features. The ensemble consistently achieves state-of-the-art results, outperforming individual models and existing methods, highlighting the efficacy of combining diverse pre-training strategies for challenging medical image analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ensembles an ImageNet-pretrained ConvNeXt-Tiny with an entropy-guided MAE version on medical data then averages probabilities, but the abstract supplies no numbers, ablations, or evidence that the models' errors are complementary enough for the averaging to matter.

read the letter

The main thing to know is that this work takes two ConvNeXt-Tiny models, pre-trains one on ImageNet and the other with entropy-guided masked autoencoding directly on the target medical images, fine-tunes both, and combines them by averaging predicted probabilities. The abstract claims this beats single models and prior methods on BUSI, ISIC 2018, Kvasir, and COVID data, but it contains no accuracy figures, error bars, or ablation results to support that.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a framework for medical image classification that combines two ConvNeXt-Tiny models: one pre-trained on ImageNet and one pre-trained via entropy-guided Masked Autoencoder (MAE) on the target medical dataset. Both are fine-tuned on the downstream tasks, and their predicted probabilities are averaged to form an ensemble. The authors claim this yields superior performance and robustness, achieving state-of-the-art results on the BUSI, ISIC 2018, Kvasir, and COVID datasets.

Significance. If the reported gains are statistically significant and arise from genuinely complementary features rather than correlated errors, the approach offers a practical recipe for blending general-domain and domain-specific pre-training in label-scarce medical imaging settings. The entropy-guided MAE component is a concrete technical choice that could be adopted more broadly.

major comments (2)

[Abstract and Results section] The central claim that the ensemble outperforms the individual models rests on the unverified assumption that the ImageNet-pretrained and entropy-guided MAE-pretrained ConvNeXt-Tiny models produce sufficiently uncorrelated errors. The manuscript supplies no prediction-correlation analysis, error-overlap statistics, or ablation comparing the ensemble to the stronger single model (see the ensemble description and results tables).
[Experimental Results] Performance tables lack error bars, statistical significance tests (e.g., paired t-tests or McNemar tests on the reported accuracy/F1 gains), and explicit train/validation/test split details. Without these, the assertions of 'superior performance and robustness' and 'state-of-the-art results' cannot be rigorously evaluated.

minor comments (2)

[Method] Clarify the precise entropy computation and masking schedule used in the MAE pre-training; a short pseudocode or equation would remove ambiguity.
[Experimental Setup] Add references for the four datasets (BUSI, ISIC 2018, Kvasir, COVID) in the experimental setup if they are currently only named.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important opportunities to strengthen the statistical validation and analysis of our ensemble approach. We address each major comment below and will revise the manuscript to incorporate additional analyses and details.

read point-by-point responses

Referee: [Abstract and Results section] The central claim that the ensemble outperforms the individual models rests on the unverified assumption that the ImageNet-pretrained and entropy-guided MAE-pretrained ConvNeXt-Tiny models produce sufficiently uncorrelated errors. The manuscript supplies no prediction-correlation analysis, error-overlap statistics, or ablation comparing the ensemble to the stronger single model (see the ensemble description and results tables).

Authors: We agree that explicit evidence of complementarity would strengthen the central claim. The manuscript already shows the ensemble outperforming both individual models on all four datasets, which is consistent with the different pre-training regimes (general-domain vs. domain-specific) yielding complementary features. However, we did not include correlation or error-overlap analysis. In the revision we will add (1) pairwise prediction correlation coefficients between the two models and (2) an ablation table directly comparing ensemble performance to the stronger single model on each dataset. These additions will provide quantitative support for the assumption of sufficiently uncorrelated errors. revision: yes
Referee: [Experimental Results] Performance tables lack error bars, statistical significance tests (e.g., paired t-tests or McNemar tests on the reported accuracy/F1 gains), and explicit train/validation/test split details. Without these, the assertions of 'superior performance and robustness' and 'state-of-the-art results' cannot be rigorously evaluated.

Authors: We acknowledge that the current presentation of results can be made more rigorous. The manuscript reports mean metrics across multiple runs but does not display error bars or conduct formal significance testing, and the split details are described at a high level. We will revise the experimental section to (1) include standard deviations or error bars in all performance tables, (2) add McNemar’s test (or paired t-tests where appropriate) to assess statistical significance of the reported gains, and (3) provide explicit, reproducible descriptions of the train/validation/test splits (including random seeds and stratification strategy) for each of the four datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework with no derivations or self-referential reductions

full rationale

The paper presents a practical ensemble method using one ImageNet-pretrained ConvNeXt-Tiny and one entropy-guided MAE-pretrained ConvNeXt-Tiny, fine-tuned on target medical datasets and combined via simple probability averaging. No equations, fitted parameters, or derivation steps are described that reduce to their own inputs by construction. Claims rest on experimental results across external datasets (BUSI, ISIC 2018, Kvasir, COVID) rather than any self-citation chain or ansatz smuggled through prior work. The complementarity assumption is an empirical hypothesis, not a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard deep-learning assumptions such as the utility of pre-trained features and the benefit of model ensembles without introducing new free parameters, axioms, or invented entities.

axioms (1)

domain assumption Pre-trained features from ImageNet and from entropy-guided MAE on medical data are complementary and improve classification when averaged
Stated in the abstract as the basis for the ensemble strategy

pith-pipeline@v0.9.0 · 5759 in / 1302 out tokens · 41469 ms · 2026-05-22T03:13:44.635621+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 3 internal anchors

[1]

Yamashita, M

R. Yamashita, M. Nishio, R. K. G. Do, K. Togashi, Convolutional neu- ral networks: an overview and application in radiology, Insights into Imaging 9 (2018) 611–629

work page 2018
[2]

Atasever, N

S. Atasever, N. Azginoglu, D. S. Terzi, R. Terzi, A comprehensive sur- vey of deep learning research on medical image analysis with focus on transfer learning, Clinical Imaging 94 (2023) 18–41

work page 2023
[3]

Jiang, Z

H. Jiang, Z. Diao, T. Shi, Y. Zhou, F. Wang, W. Hu, X. Zhu, S. Luo, G. Tong, Y.-D. Yao, A review of deep learning-based multiple-lesion recognition from medical images: classification, detection and segmen- tation, Computers in Biology and Medicine 157 (2023) 106726

work page 2023
[4]

R. R. Yellu, Y. Kukalakunta, P. Thunki, Medical image analysis- challenges and innovations: Studying challenges and innovations in med- ical image analysis for applications such as diagnosis, treatment plan- ning, and image-guided surgery, Journal of Artificial Intelligence Re- search and Applications 4 (1) (2024) 93–100

work page 2024
[5]

LeCun, B

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub- bard, L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural computation 1 (4) (1989) 541–551

work page 1989
[6]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan, A. Zisserman, Very deep convolutional networks for large- scale image recognition, arXiv preprint arXiv:1409.1556 (2014). 18

work page internal anchor Pith review Pith/arXiv arXiv 2014
[7]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition (2016) 770–778

work page 2016
[8]

D. G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision 60 (2) (2004) 91–110

work page 2004
[9]

A. W. Salehi, S. Khan, G. Gupta, B. I. Alabduallah, A. Almjally, H. Al- solai, T. Siddiqui, A. Mellit, A study of cnn and transfer learning in medical imaging: Advantages, challenges, future scope, Sustainability 15 (7) (2023) 5930

work page 2023
[10]

Spolaôr, H

N. Spolaôr, H. D. Lee, A. I. Mendes, C. V. Nogueira, A. R. S. Parmezan, W.S.R.Takaki, F.C.Coy, F.C.Wu, R.Fonseca-Pinto, Fine-tuningpre- trained neural networks for medical image classification in small clinical datasets, Multimedia Tools and Applications 83 (9) (2024) 27305–27329

work page 2024
[11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[12]

J. Guo, K. Han, H. Wu, Y. Tang, X. Chen, Y. Wang, C. Xu, Cmt: Con- volutional neural networks meet vision transformers, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) 12175–12185

work page 2022
[13]

W. Lin, Z. Wu, J. Chen, J. Huang, L. Jin, Scale-aware modulation meet transformer, Proceedings of the IEEE/CVF International Conference on Computer Vision (2023) 6015–6026

work page 2023
[14]

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, Pro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion (2021) 10012–10022

work page 2021
[15]

Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A con- vnet for the 2020s, Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (2022) 11976–11986. 19

work page 2022
[16]

Azizi, B

S. Azizi, B. Mustafa, F. Ryan, Z. Beaver, J. Freyberg, A. Deaton, A. Loh, A. Karthikesalingam, S. Kornblith, T. Chen, et al., Big self- supervised models advance medical image classification, Proceedings of the IEEE/CVF International Conference on Computer Vision (2021) 3478–3488

work page 2021
[17]

Nielsen, L

M. Nielsen, L. Wenderoth, T. Sentker, R. Werner, Self-supervision for medical image classification: State-of-the-art performance with˜ 100 la- beled training samples per class, Bioengineering 10 (8) (2023) 895

work page 2023
[18]

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked au- toencoders are scalable vision learners, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) 16000– 16009

work page 2022
[19]

B. Chen, D. Martí Monsó, Y. Du, M. Simchowitz, R. Tedrake, V. Sitz- mann, Diffusion forcing: Next-token prediction meets full-sequence dif- fusion, Advances in Neural Information Processing Systems 37 (2024) 24081–24125

work page 2024
[20]

Huang, Z

G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely con- nected convolutional networks, Proceedings of the IEEE conference on computer vision and pattern recognition (2017) 4700–4708

work page 2017
[21]

Jiang, L

Y. Jiang, L. Chen, H. Zhang, X. Xiao, Breast cancer histopathological image classification using convolutional neural networks with small se- resnet module, PloS one 14 (3) (2019) e0214587

work page 2019
[22]

Z. Cai, Y. Chen, J. Wang, X. He, Z. Pei, X. Lei, C. Lu, Dafnet: A novel dynamic adaptive fusion network for medical image classification, Information Fusion 126 (2026) 103507

work page 2026
[23]

Z. Ren, S. Liu, L. Wang, Z. Guo, Conv-sdmlpmixer: A hybrid medical image classification network based on multi-branch cnn and multi-scale multi-dimensional mlp, Information Fusion 118 (2025) 102937

work page 2025
[24]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). 20

work page 2017
[25]

O. N. Manzari, H. Ahmadabadi, H. Kashiani, S. B. Shokouhi, A. Aya- tollahi, Medvit: a robust vision transformer for generalized medical im- age classification, Computers in Biology and and Medicine 157 (2023) 106791

work page 2023
[26]

O. N. Manzari, H. Asgariandehkordi, T. Koleilat, et al., Medical image classificationwithkan-integratedtransformersanddilatedneighborhood attention, Applied Soft Computing (2025)

work page 2025
[27]

Y. Yue, Z. Li, Medmamba: Vision mamba for medical image classifica- tion, arXiv.org (2024)

work page 2024
[28]

Sevinç, M

A. Sevinç, M. Ucan, B. Kaya, A distillation approach to transformer- based medical image classification with limited data, Diagnostics (2025)

work page 2025
[29]

X. Wu, Y. Feng, H. Xu, Z. Lin, T. Chen, S. Li, S. Qiu, Q. Liu, Y. Ma, S. Zhang, Ctranscnn: Combining transformer and cnn in multil- abel medical image classification, Knowledge-Based Systems 281 (2023) 111030

work page 2023
[30]

X. Huo, G. Sun, S. Tian, Y. Wang, L. Yu, J. Long, W. Zhang, A. Li, Hifuse: Hierarchical multi-scale feature fusion network for medical im- age classification, Biomedical Signal Processing and Control 87 (2024) 105534

work page 2024
[31]

Hussain, H

T. Hussain, H. Shouno, A. Hussain, D. Hussain, M. Ismail, T. H. Mir, F. R. Hsu, T. Alam, S. A. Akhy, Effresnet-vit: A fusion-based con- volutional and vision transformer model for explainable medical image classification, IEEE Access (2025)

work page 2025
[32]

Djoumessi, S

K. Djoumessi, S. O. Mensah, P. Berens, A hybrid fully convolutional cnn-transformer model for inherently interpretable medical image clas- sification, arXiv.org (2025)

work page 2025
[33]

R. Lu, L. Yu, S. Tian, Y. Xiao, Biologically inspired vision fusion: Central-peripheral synergy for medical image classification, Engineer- ing Applications of Artificial Intelligence 169 (2026) 114026

work page 2026
[34]

X. Kong, X. Zhang, Understanding masked image modeling via learning occlusion invariant feature, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023) 6241–6251. 21

work page 2023
[35]

J. Xu, S. Stirenko, J. Mao, Self-supervised model based on masked autoencoders advance ct scans classification, arXiv preprint arXiv:2210.05073 (2022)

work page arXiv 2022
[36]

J. Mao, S. Guo, X. Yin, Y. Chang, B. Nie, Y. Wang, Medical supervised masked autoencoder: Crafting a better masking strategy and efficient fine-tuning schedule for medical image classification, Applied Soft Com- puting 169 (2025) 112536

work page 2025
[37]

A. Sagar, Pmaf loss: Probabilistic margin-aware focal loss for robust medical image classification, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 344–349

work page 2026
[38]

Y. Yang, H. Fu, A. I. Avilés-Rivero, Z. Xing, L. Zhu, Diffmic-v2: Medical image classification via improved diffusion network, IEEE Transactions on Medical Imaging 44 (5) (2025) 2244–2255

work page 2025
[39]

Zhang, Z

X. Zhang, Z. Xiao, J. Ma, et al., Adaptive dual-axis style-based recal- ibration network with class-wise statistics loss for imbalanced medical image classification, IEEE Transactions on Image Processing (2025)

work page 2025
[40]

J. Hu, Y. Xiang, Y. Lin, J. Du, H. Zhang, H. Liu, Multi-scale trans- former architecture for accurate medical image classification, in: Pro- ceedings of the 2025 International Conference on Artificial Intelligence and Computational Intelligence, 2025, pp. 409–414

work page 2025
[41]

Dehbozorgi, O

P. Dehbozorgi, O. Ryabchykov, T. W. Bocklitz, A comparative study of statistical, radiomics, and deep learning feature extraction techniques for medical image classification in optical and radiological modalities, Computers in biology and medicine 187 (2025) 109768

work page 2025
[42]

Sakirin, R

T. Sakirin, R. B. Said, Application of deep learning and transfer learning techniques for medical image classification, Edraak 2025 (2025) 38–46

work page 2025
[43]

J. Qiu, J. Cao, Y. Huang, Z. Zhu, F. Wang, C. Lu, Y. Li, Y. Zheng, Muscle: A new perspective to multi-scale fusion for medical image classi- fication based on the theory of evidence, IEEE Transactions on Medical Imaging 45 (3) (2026) 893–905

work page 2026
[44]

Al-Dhabyani, M

W. Al-Dhabyani, M. Gomaa, H. Khaled, A. Fahmy, Dataset of breast ultrasound images, Data in brief 28 (2020) 104863. 22

work page 2020
[45]

Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gut- man, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al., Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic), arXiv preprint arXiv:1902.03368 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Pogorelov, K

K. Pogorelov, K. R. Randel, C. Griwodz, S. L. Eskeland, T. de Lange, D. Johansen, C. Spampinato, D.-T. Dang-Nguyen, M. Lux, P. T. Schmidt, et al., Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection, Proceedings of the 8th ACM on Mul- timedia Systems Conference (2017) 164–169

work page 2017
[47]

M. E. H. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M. A. Kadir, Z. B. Mahbub, K. R. Islam, M. S. Khan, A. Iqbal, N. A. Emadi, et al., Can ai help in screening viral and covid-19 pneumonia?, IEEE Access 8 (2020) 132665–132676

work page 2020
[48]

Rahman, A

T. Rahman, A. Khandakar, Y. Qiblawey, A. Tahir, S. Kiranyaz, S. B. A. Kashem, M. T. Islam, S. Al Maadeed, S. M. Zughaier, M. S. Khan, et al., Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images, Computers in Biology and Medicine 132 (2021) 104319

work page 2021
[49]

J. Yang, C. Li, X. Dai, J. Gao, Focal modulation networks, Advances in Neural Information Processing Systems 35 (2022) 4203–4217

work page 2022
[50]

J. Min, Y. Zhao, C. Luo, M. Cho, Peripheral vision transformer, Ad- vances in Neural Information Processing Systems 35 (2022) 32097– 32111

work page 2022
[51]

S. Ren, X. Yang, S. Liu, X. Wang, Sg-former: Self-guided transformer with evolving token reallocation, Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (2023) 6003–6014

work page 2023
[52]

X. Huo, S. Tian, Y. Yang, L. Yu, W. Zhang, A. Li, Spa: Self-peripheral- attention for central-peripheral interactions in endoscopic image classifi- cation and segmentation, Expert Systems with Applications 245 (2024) 123053. 23

work page 2024

[1] [1]

Yamashita, M

R. Yamashita, M. Nishio, R. K. G. Do, K. Togashi, Convolutional neu- ral networks: an overview and application in radiology, Insights into Imaging 9 (2018) 611–629

work page 2018

[2] [2]

Atasever, N

S. Atasever, N. Azginoglu, D. S. Terzi, R. Terzi, A comprehensive sur- vey of deep learning research on medical image analysis with focus on transfer learning, Clinical Imaging 94 (2023) 18–41

work page 2023

[3] [3]

Jiang, Z

H. Jiang, Z. Diao, T. Shi, Y. Zhou, F. Wang, W. Hu, X. Zhu, S. Luo, G. Tong, Y.-D. Yao, A review of deep learning-based multiple-lesion recognition from medical images: classification, detection and segmen- tation, Computers in Biology and Medicine 157 (2023) 106726

work page 2023

[4] [4]

R. R. Yellu, Y. Kukalakunta, P. Thunki, Medical image analysis- challenges and innovations: Studying challenges and innovations in med- ical image analysis for applications such as diagnosis, treatment plan- ning, and image-guided surgery, Journal of Artificial Intelligence Re- search and Applications 4 (1) (2024) 93–100

work page 2024

[5] [5]

LeCun, B

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub- bard, L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural computation 1 (4) (1989) 541–551

work page 1989

[6] [6]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan, A. Zisserman, Very deep convolutional networks for large- scale image recognition, arXiv preprint arXiv:1409.1556 (2014). 18

work page internal anchor Pith review Pith/arXiv arXiv 2014

[7] [7]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition (2016) 770–778

work page 2016

[8] [8]

D. G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision 60 (2) (2004) 91–110

work page 2004

[9] [9]

A. W. Salehi, S. Khan, G. Gupta, B. I. Alabduallah, A. Almjally, H. Al- solai, T. Siddiqui, A. Mellit, A study of cnn and transfer learning in medical imaging: Advantages, challenges, future scope, Sustainability 15 (7) (2023) 5930

work page 2023

[10] [10]

Spolaôr, H

N. Spolaôr, H. D. Lee, A. I. Mendes, C. V. Nogueira, A. R. S. Parmezan, W.S.R.Takaki, F.C.Coy, F.C.Wu, R.Fonseca-Pinto, Fine-tuningpre- trained neural networks for medical image classification in small clinical datasets, Multimedia Tools and Applications 83 (9) (2024) 27305–27329

work page 2024

[11] [11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[12] [12]

J. Guo, K. Han, H. Wu, Y. Tang, X. Chen, Y. Wang, C. Xu, Cmt: Con- volutional neural networks meet vision transformers, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) 12175–12185

work page 2022

[13] [13]

W. Lin, Z. Wu, J. Chen, J. Huang, L. Jin, Scale-aware modulation meet transformer, Proceedings of the IEEE/CVF International Conference on Computer Vision (2023) 6015–6026

work page 2023

[14] [14]

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, Pro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion (2021) 10012–10022

work page 2021

[15] [15]

Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A con- vnet for the 2020s, Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (2022) 11976–11986. 19

work page 2022

[16] [16]

Azizi, B

S. Azizi, B. Mustafa, F. Ryan, Z. Beaver, J. Freyberg, A. Deaton, A. Loh, A. Karthikesalingam, S. Kornblith, T. Chen, et al., Big self- supervised models advance medical image classification, Proceedings of the IEEE/CVF International Conference on Computer Vision (2021) 3478–3488

work page 2021

[17] [17]

Nielsen, L

M. Nielsen, L. Wenderoth, T. Sentker, R. Werner, Self-supervision for medical image classification: State-of-the-art performance with˜ 100 la- beled training samples per class, Bioengineering 10 (8) (2023) 895

work page 2023

[18] [18]

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked au- toencoders are scalable vision learners, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) 16000– 16009

work page 2022

[19] [19]

B. Chen, D. Martí Monsó, Y. Du, M. Simchowitz, R. Tedrake, V. Sitz- mann, Diffusion forcing: Next-token prediction meets full-sequence dif- fusion, Advances in Neural Information Processing Systems 37 (2024) 24081–24125

work page 2024

[20] [20]

Huang, Z

G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely con- nected convolutional networks, Proceedings of the IEEE conference on computer vision and pattern recognition (2017) 4700–4708

work page 2017

[21] [21]

Jiang, L

Y. Jiang, L. Chen, H. Zhang, X. Xiao, Breast cancer histopathological image classification using convolutional neural networks with small se- resnet module, PloS one 14 (3) (2019) e0214587

work page 2019

[22] [22]

Z. Cai, Y. Chen, J. Wang, X. He, Z. Pei, X. Lei, C. Lu, Dafnet: A novel dynamic adaptive fusion network for medical image classification, Information Fusion 126 (2026) 103507

work page 2026

[23] [23]

Z. Ren, S. Liu, L. Wang, Z. Guo, Conv-sdmlpmixer: A hybrid medical image classification network based on multi-branch cnn and multi-scale multi-dimensional mlp, Information Fusion 118 (2025) 102937

work page 2025

[24] [24]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). 20

work page 2017

[25] [25]

O. N. Manzari, H. Ahmadabadi, H. Kashiani, S. B. Shokouhi, A. Aya- tollahi, Medvit: a robust vision transformer for generalized medical im- age classification, Computers in Biology and and Medicine 157 (2023) 106791

work page 2023

[26] [26]

O. N. Manzari, H. Asgariandehkordi, T. Koleilat, et al., Medical image classificationwithkan-integratedtransformersanddilatedneighborhood attention, Applied Soft Computing (2025)

work page 2025

[27] [27]

Y. Yue, Z. Li, Medmamba: Vision mamba for medical image classifica- tion, arXiv.org (2024)

work page 2024

[28] [28]

Sevinç, M

A. Sevinç, M. Ucan, B. Kaya, A distillation approach to transformer- based medical image classification with limited data, Diagnostics (2025)

work page 2025

[29] [29]

X. Wu, Y. Feng, H. Xu, Z. Lin, T. Chen, S. Li, S. Qiu, Q. Liu, Y. Ma, S. Zhang, Ctranscnn: Combining transformer and cnn in multil- abel medical image classification, Knowledge-Based Systems 281 (2023) 111030

work page 2023

[30] [30]

X. Huo, G. Sun, S. Tian, Y. Wang, L. Yu, J. Long, W. Zhang, A. Li, Hifuse: Hierarchical multi-scale feature fusion network for medical im- age classification, Biomedical Signal Processing and Control 87 (2024) 105534

work page 2024

[31] [31]

Hussain, H

T. Hussain, H. Shouno, A. Hussain, D. Hussain, M. Ismail, T. H. Mir, F. R. Hsu, T. Alam, S. A. Akhy, Effresnet-vit: A fusion-based con- volutional and vision transformer model for explainable medical image classification, IEEE Access (2025)

work page 2025

[32] [32]

Djoumessi, S

K. Djoumessi, S. O. Mensah, P. Berens, A hybrid fully convolutional cnn-transformer model for inherently interpretable medical image clas- sification, arXiv.org (2025)

work page 2025

[33] [33]

R. Lu, L. Yu, S. Tian, Y. Xiao, Biologically inspired vision fusion: Central-peripheral synergy for medical image classification, Engineer- ing Applications of Artificial Intelligence 169 (2026) 114026

work page 2026

[34] [34]

X. Kong, X. Zhang, Understanding masked image modeling via learning occlusion invariant feature, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023) 6241–6251. 21

work page 2023

[35] [35]

J. Xu, S. Stirenko, J. Mao, Self-supervised model based on masked autoencoders advance ct scans classification, arXiv preprint arXiv:2210.05073 (2022)

work page arXiv 2022

[36] [36]

J. Mao, S. Guo, X. Yin, Y. Chang, B. Nie, Y. Wang, Medical supervised masked autoencoder: Crafting a better masking strategy and efficient fine-tuning schedule for medical image classification, Applied Soft Com- puting 169 (2025) 112536

work page 2025

[37] [37]

A. Sagar, Pmaf loss: Probabilistic margin-aware focal loss for robust medical image classification, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 344–349

work page 2026

[38] [38]

Y. Yang, H. Fu, A. I. Avilés-Rivero, Z. Xing, L. Zhu, Diffmic-v2: Medical image classification via improved diffusion network, IEEE Transactions on Medical Imaging 44 (5) (2025) 2244–2255

work page 2025

[39] [39]

Zhang, Z

X. Zhang, Z. Xiao, J. Ma, et al., Adaptive dual-axis style-based recal- ibration network with class-wise statistics loss for imbalanced medical image classification, IEEE Transactions on Image Processing (2025)

work page 2025

[40] [40]

J. Hu, Y. Xiang, Y. Lin, J. Du, H. Zhang, H. Liu, Multi-scale trans- former architecture for accurate medical image classification, in: Pro- ceedings of the 2025 International Conference on Artificial Intelligence and Computational Intelligence, 2025, pp. 409–414

work page 2025

[41] [41]

Dehbozorgi, O

P. Dehbozorgi, O. Ryabchykov, T. W. Bocklitz, A comparative study of statistical, radiomics, and deep learning feature extraction techniques for medical image classification in optical and radiological modalities, Computers in biology and medicine 187 (2025) 109768

work page 2025

[42] [42]

Sakirin, R

T. Sakirin, R. B. Said, Application of deep learning and transfer learning techniques for medical image classification, Edraak 2025 (2025) 38–46

work page 2025

[43] [43]

J. Qiu, J. Cao, Y. Huang, Z. Zhu, F. Wang, C. Lu, Y. Li, Y. Zheng, Muscle: A new perspective to multi-scale fusion for medical image classi- fication based on the theory of evidence, IEEE Transactions on Medical Imaging 45 (3) (2026) 893–905

work page 2026

[44] [44]

Al-Dhabyani, M

W. Al-Dhabyani, M. Gomaa, H. Khaled, A. Fahmy, Dataset of breast ultrasound images, Data in brief 28 (2020) 104863. 22

work page 2020

[45] [45]

Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gut- man, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al., Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic), arXiv preprint arXiv:1902.03368 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[46] [46]

Pogorelov, K

K. Pogorelov, K. R. Randel, C. Griwodz, S. L. Eskeland, T. de Lange, D. Johansen, C. Spampinato, D.-T. Dang-Nguyen, M. Lux, P. T. Schmidt, et al., Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection, Proceedings of the 8th ACM on Mul- timedia Systems Conference (2017) 164–169

work page 2017

[47] [47]

M. E. H. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M. A. Kadir, Z. B. Mahbub, K. R. Islam, M. S. Khan, A. Iqbal, N. A. Emadi, et al., Can ai help in screening viral and covid-19 pneumonia?, IEEE Access 8 (2020) 132665–132676

work page 2020

[48] [48]

Rahman, A

T. Rahman, A. Khandakar, Y. Qiblawey, A. Tahir, S. Kiranyaz, S. B. A. Kashem, M. T. Islam, S. Al Maadeed, S. M. Zughaier, M. S. Khan, et al., Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images, Computers in Biology and Medicine 132 (2021) 104319

work page 2021

[49] [49]

J. Yang, C. Li, X. Dai, J. Gao, Focal modulation networks, Advances in Neural Information Processing Systems 35 (2022) 4203–4217

work page 2022

[50] [50]

J. Min, Y. Zhao, C. Luo, M. Cho, Peripheral vision transformer, Ad- vances in Neural Information Processing Systems 35 (2022) 32097– 32111

work page 2022

[51] [51]

S. Ren, X. Yang, S. Liu, X. Wang, Sg-former: Self-guided transformer with evolving token reallocation, Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (2023) 6003–6014

work page 2023

[52] [52]

X. Huo, S. Tian, Y. Yang, L. Yu, W. Zhang, A. Li, Spa: Self-peripheral- attention for central-peripheral interactions in endoscopic image classifi- cation and segmentation, Expert Systems with Applications 245 (2024) 123053. 23

work page 2024