Entropy-Guided Self-Supervised Learning for Medical Image Classification
Pith reviewed 2026-05-22 03:13 UTC · model grok-4.3
The pith
An ensemble averaging predictions from ImageNet-pretrained and entropy-guided MAE-pretrained ConvNeXt models improves medical image classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that pre-training one ConvNeXt-Tiny on ImageNet and a second on the medical dataset via entropy-guided masked autoencoding, followed by fine-tuning and simple probability averaging, produces complementary features that yield state-of-the-art classification accuracy and robustness on the BUSI, ISIC 2018, Kvasir, and COVID-19 datasets.
What carries the argument
Ensemble formed by averaging the predicted probabilities of an ImageNet-pretrained ConvNeXt-Tiny and an entropy-guided MAE-pretrained ConvNeXt-Tiny after task-specific fine-tuning.
If this is right
- Domain-specific MAE pre-training supplies features that complement the general features from ImageNet.
- Probability averaging produces higher accuracy than either model used alone.
- The method reaches state-of-the-art results on four distinct medical imaging modalities.
- Combining broad and narrow pre-training strategies mitigates limited annotated data and high intra-class variability.
Where Pith is reading between the lines
- The same dual-pretraining pattern could be tried with other backbone families such as Vision Transformers.
- Entropy-guided masking might be tested inside other self-supervised objectives beyond MAE.
- Weighted fusion or learned combination layers could replace simple averaging in future variants.
- The approach may transfer to other data-scarce domains such as satellite or microscopy images.
Load-bearing premise
The two pre-trained models supply complementary features that are effectively combined simply by averaging their predicted probabilities.
What would settle it
If the ensemble fails to outperform both individual models and existing methods on a fifth independent medical imaging dataset, the central claim would be falsified.
Figures
read the original abstract
Accurate and robust medical image classification is paramount for early disease diagnosis and treatment planning. However, challenges such as limited annotated data, high intra-class variability, and subtle inter-class differences often hinder the performance of deep learning models. This paper introduces a synergistic deep learning framework that leverages the strengths of self-supervised learning and transfer learning for enhanced medical image classification. Our approach employs two distinct ConvNeXt-Tiny models: one pre-trained on a large-scale natural image dataset (ImageNet) and another pre-trained using an entropy-guided Masked Autoencoder (MAE) on the target medical dataset. Both models are then fine-tuned on specific medical image classification tasks. A final ensemble strategy, based on averaging predicted probabilities, is utilized to combine the complementary insights from these two models. Rigorous experimental validation across four diverse medical imaging datasets (Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC) 2018, Kvasir, and COVID) demonstrates the superior performance and robustness of our ensemble approach. The MAE pre-training significantly improves feature learning on domain-specific data, while the ImageNet pre-training provides strong generalizable features. The ensemble consistently achieves state-of-the-art results, outperforming individual models and existing methods, highlighting the efficacy of combining diverse pre-training strategies for challenging medical image analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a framework for medical image classification that combines two ConvNeXt-Tiny models: one pre-trained on ImageNet and one pre-trained via entropy-guided Masked Autoencoder (MAE) on the target medical dataset. Both are fine-tuned on the downstream tasks, and their predicted probabilities are averaged to form an ensemble. The authors claim this yields superior performance and robustness, achieving state-of-the-art results on the BUSI, ISIC 2018, Kvasir, and COVID datasets.
Significance. If the reported gains are statistically significant and arise from genuinely complementary features rather than correlated errors, the approach offers a practical recipe for blending general-domain and domain-specific pre-training in label-scarce medical imaging settings. The entropy-guided MAE component is a concrete technical choice that could be adopted more broadly.
major comments (2)
- [Abstract and Results section] The central claim that the ensemble outperforms the individual models rests on the unverified assumption that the ImageNet-pretrained and entropy-guided MAE-pretrained ConvNeXt-Tiny models produce sufficiently uncorrelated errors. The manuscript supplies no prediction-correlation analysis, error-overlap statistics, or ablation comparing the ensemble to the stronger single model (see the ensemble description and results tables).
- [Experimental Results] Performance tables lack error bars, statistical significance tests (e.g., paired t-tests or McNemar tests on the reported accuracy/F1 gains), and explicit train/validation/test split details. Without these, the assertions of 'superior performance and robustness' and 'state-of-the-art results' cannot be rigorously evaluated.
minor comments (2)
- [Method] Clarify the precise entropy computation and masking schedule used in the MAE pre-training; a short pseudocode or equation would remove ambiguity.
- [Experimental Setup] Add references for the four datasets (BUSI, ISIC 2018, Kvasir, COVID) in the experimental setup if they are currently only named.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important opportunities to strengthen the statistical validation and analysis of our ensemble approach. We address each major comment below and will revise the manuscript to incorporate additional analyses and details.
read point-by-point responses
-
Referee: [Abstract and Results section] The central claim that the ensemble outperforms the individual models rests on the unverified assumption that the ImageNet-pretrained and entropy-guided MAE-pretrained ConvNeXt-Tiny models produce sufficiently uncorrelated errors. The manuscript supplies no prediction-correlation analysis, error-overlap statistics, or ablation comparing the ensemble to the stronger single model (see the ensemble description and results tables).
Authors: We agree that explicit evidence of complementarity would strengthen the central claim. The manuscript already shows the ensemble outperforming both individual models on all four datasets, which is consistent with the different pre-training regimes (general-domain vs. domain-specific) yielding complementary features. However, we did not include correlation or error-overlap analysis. In the revision we will add (1) pairwise prediction correlation coefficients between the two models and (2) an ablation table directly comparing ensemble performance to the stronger single model on each dataset. These additions will provide quantitative support for the assumption of sufficiently uncorrelated errors. revision: yes
-
Referee: [Experimental Results] Performance tables lack error bars, statistical significance tests (e.g., paired t-tests or McNemar tests on the reported accuracy/F1 gains), and explicit train/validation/test split details. Without these, the assertions of 'superior performance and robustness' and 'state-of-the-art results' cannot be rigorously evaluated.
Authors: We acknowledge that the current presentation of results can be made more rigorous. The manuscript reports mean metrics across multiple runs but does not display error bars or conduct formal significance testing, and the split details are described at a high level. We will revise the experimental section to (1) include standard deviations or error bars in all performance tables, (2) add McNemar’s test (or paired t-tests where appropriate) to assess statistical significance of the reported gains, and (3) provide explicit, reproducible descriptions of the train/validation/test splits (including random seeds and stratification strategy) for each of the four datasets. revision: yes
Circularity Check
No circularity: purely empirical framework with no derivations or self-referential reductions
full rationale
The paper presents a practical ensemble method using one ImageNet-pretrained ConvNeXt-Tiny and one entropy-guided MAE-pretrained ConvNeXt-Tiny, fine-tuned on target medical datasets and combined via simple probability averaging. No equations, fitted parameters, or derivation steps are described that reduce to their own inputs by construction. Claims rest on experimental results across external datasets (BUSI, ISIC 2018, Kvasir, COVID) rather than any self-citation chain or ansatz smuggled through prior work. The complementarity assumption is an empirical hypothesis, not a definitional or fitted tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained features from ImageNet and from entropy-guided MAE on medical data are complementary and improve classification when averaged
Reference graph
Works this paper leans on
-
[1]
R. Yamashita, M. Nishio, R. K. G. Do, K. Togashi, Convolutional neu- ral networks: an overview and application in radiology, Insights into Imaging 9 (2018) 611–629
work page 2018
-
[2]
S. Atasever, N. Azginoglu, D. S. Terzi, R. Terzi, A comprehensive sur- vey of deep learning research on medical image analysis with focus on transfer learning, Clinical Imaging 94 (2023) 18–41
work page 2023
- [3]
-
[4]
R. R. Yellu, Y. Kukalakunta, P. Thunki, Medical image analysis- challenges and innovations: Studying challenges and innovations in med- ical image analysis for applications such as diagnosis, treatment plan- ning, and image-guided surgery, Journal of Artificial Intelligence Re- search and Applications 4 (1) (2024) 93–100
work page 2024
- [5]
-
[6]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan, A. Zisserman, Very deep convolutional networks for large- scale image recognition, arXiv preprint arXiv:1409.1556 (2014). 18
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[7]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition (2016) 770–778
work page 2016
-
[8]
D. G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision 60 (2) (2004) 91–110
work page 2004
-
[9]
A. W. Salehi, S. Khan, G. Gupta, B. I. Alabduallah, A. Almjally, H. Al- solai, T. Siddiqui, A. Mellit, A study of cnn and transfer learning in medical imaging: Advantages, challenges, future scope, Sustainability 15 (7) (2023) 5930
work page 2023
-
[10]
N. Spolaôr, H. D. Lee, A. I. Mendes, C. V. Nogueira, A. R. S. Parmezan, W.S.R.Takaki, F.C.Coy, F.C.Wu, R.Fonseca-Pinto, Fine-tuningpre- trained neural networks for medical image classification in small clinical datasets, Multimedia Tools and Applications 83 (9) (2024) 27305–27329
work page 2024
-
[11]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[12]
J. Guo, K. Han, H. Wu, Y. Tang, X. Chen, Y. Wang, C. Xu, Cmt: Con- volutional neural networks meet vision transformers, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) 12175–12185
work page 2022
-
[13]
W. Lin, Z. Wu, J. Chen, J. Huang, L. Jin, Scale-aware modulation meet transformer, Proceedings of the IEEE/CVF International Conference on Computer Vision (2023) 6015–6026
work page 2023
-
[14]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, Pro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion (2021) 10012–10022
work page 2021
-
[15]
Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A con- vnet for the 2020s, Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (2022) 11976–11986. 19
work page 2022
-
[16]
S. Azizi, B. Mustafa, F. Ryan, Z. Beaver, J. Freyberg, A. Deaton, A. Loh, A. Karthikesalingam, S. Kornblith, T. Chen, et al., Big self- supervised models advance medical image classification, Proceedings of the IEEE/CVF International Conference on Computer Vision (2021) 3478–3488
work page 2021
-
[17]
M. Nielsen, L. Wenderoth, T. Sentker, R. Werner, Self-supervision for medical image classification: State-of-the-art performance with˜ 100 la- beled training samples per class, Bioengineering 10 (8) (2023) 895
work page 2023
-
[18]
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked au- toencoders are scalable vision learners, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) 16000– 16009
work page 2022
-
[19]
B. Chen, D. Martí Monsó, Y. Du, M. Simchowitz, R. Tedrake, V. Sitz- mann, Diffusion forcing: Next-token prediction meets full-sequence dif- fusion, Advances in Neural Information Processing Systems 37 (2024) 24081–24125
work page 2024
- [20]
- [21]
-
[22]
Z. Cai, Y. Chen, J. Wang, X. He, Z. Pei, X. Lei, C. Lu, Dafnet: A novel dynamic adaptive fusion network for medical image classification, Information Fusion 126 (2026) 103507
work page 2026
-
[23]
Z. Ren, S. Liu, L. Wang, Z. Guo, Conv-sdmlpmixer: A hybrid medical image classification network based on multi-branch cnn and multi-scale multi-dimensional mlp, Information Fusion 118 (2025) 102937
work page 2025
-
[24]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). 20
work page 2017
-
[25]
O. N. Manzari, H. Ahmadabadi, H. Kashiani, S. B. Shokouhi, A. Aya- tollahi, Medvit: a robust vision transformer for generalized medical im- age classification, Computers in Biology and and Medicine 157 (2023) 106791
work page 2023
-
[26]
O. N. Manzari, H. Asgariandehkordi, T. Koleilat, et al., Medical image classificationwithkan-integratedtransformersanddilatedneighborhood attention, Applied Soft Computing (2025)
work page 2025
-
[27]
Y. Yue, Z. Li, Medmamba: Vision mamba for medical image classifica- tion, arXiv.org (2024)
work page 2024
- [28]
-
[29]
X. Wu, Y. Feng, H. Xu, Z. Lin, T. Chen, S. Li, S. Qiu, Q. Liu, Y. Ma, S. Zhang, Ctranscnn: Combining transformer and cnn in multil- abel medical image classification, Knowledge-Based Systems 281 (2023) 111030
work page 2023
-
[30]
X. Huo, G. Sun, S. Tian, Y. Wang, L. Yu, J. Long, W. Zhang, A. Li, Hifuse: Hierarchical multi-scale feature fusion network for medical im- age classification, Biomedical Signal Processing and Control 87 (2024) 105534
work page 2024
-
[31]
T. Hussain, H. Shouno, A. Hussain, D. Hussain, M. Ismail, T. H. Mir, F. R. Hsu, T. Alam, S. A. Akhy, Effresnet-vit: A fusion-based con- volutional and vision transformer model for explainable medical image classification, IEEE Access (2025)
work page 2025
-
[32]
K. Djoumessi, S. O. Mensah, P. Berens, A hybrid fully convolutional cnn-transformer model for inherently interpretable medical image clas- sification, arXiv.org (2025)
work page 2025
-
[33]
R. Lu, L. Yu, S. Tian, Y. Xiao, Biologically inspired vision fusion: Central-peripheral synergy for medical image classification, Engineer- ing Applications of Artificial Intelligence 169 (2026) 114026
work page 2026
-
[34]
X. Kong, X. Zhang, Understanding masked image modeling via learning occlusion invariant feature, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023) 6241–6251. 21
work page 2023
- [35]
-
[36]
J. Mao, S. Guo, X. Yin, Y. Chang, B. Nie, Y. Wang, Medical supervised masked autoencoder: Crafting a better masking strategy and efficient fine-tuning schedule for medical image classification, Applied Soft Com- puting 169 (2025) 112536
work page 2025
-
[37]
A. Sagar, Pmaf loss: Probabilistic margin-aware focal loss for robust medical image classification, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 344–349
work page 2026
-
[38]
Y. Yang, H. Fu, A. I. Avilés-Rivero, Z. Xing, L. Zhu, Diffmic-v2: Medical image classification via improved diffusion network, IEEE Transactions on Medical Imaging 44 (5) (2025) 2244–2255
work page 2025
- [39]
-
[40]
J. Hu, Y. Xiang, Y. Lin, J. Du, H. Zhang, H. Liu, Multi-scale trans- former architecture for accurate medical image classification, in: Pro- ceedings of the 2025 International Conference on Artificial Intelligence and Computational Intelligence, 2025, pp. 409–414
work page 2025
-
[41]
P. Dehbozorgi, O. Ryabchykov, T. W. Bocklitz, A comparative study of statistical, radiomics, and deep learning feature extraction techniques for medical image classification in optical and radiological modalities, Computers in biology and medicine 187 (2025) 109768
work page 2025
-
[42]
T. Sakirin, R. B. Said, Application of deep learning and transfer learning techniques for medical image classification, Edraak 2025 (2025) 38–46
work page 2025
-
[43]
J. Qiu, J. Cao, Y. Huang, Z. Zhu, F. Wang, C. Lu, Y. Li, Y. Zheng, Muscle: A new perspective to multi-scale fusion for medical image classi- fication based on the theory of evidence, IEEE Transactions on Medical Imaging 45 (3) (2026) 893–905
work page 2026
-
[44]
W. Al-Dhabyani, M. Gomaa, H. Khaled, A. Fahmy, Dataset of breast ultrasound images, Data in brief 28 (2020) 104863. 22
work page 2020
-
[45]
N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gut- man, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al., Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic), arXiv preprint arXiv:1902.03368 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[46]
K. Pogorelov, K. R. Randel, C. Griwodz, S. L. Eskeland, T. de Lange, D. Johansen, C. Spampinato, D.-T. Dang-Nguyen, M. Lux, P. T. Schmidt, et al., Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection, Proceedings of the 8th ACM on Mul- timedia Systems Conference (2017) 164–169
work page 2017
-
[47]
M. E. H. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M. A. Kadir, Z. B. Mahbub, K. R. Islam, M. S. Khan, A. Iqbal, N. A. Emadi, et al., Can ai help in screening viral and covid-19 pneumonia?, IEEE Access 8 (2020) 132665–132676
work page 2020
-
[48]
T. Rahman, A. Khandakar, Y. Qiblawey, A. Tahir, S. Kiranyaz, S. B. A. Kashem, M. T. Islam, S. Al Maadeed, S. M. Zughaier, M. S. Khan, et al., Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images, Computers in Biology and Medicine 132 (2021) 104319
work page 2021
-
[49]
J. Yang, C. Li, X. Dai, J. Gao, Focal modulation networks, Advances in Neural Information Processing Systems 35 (2022) 4203–4217
work page 2022
-
[50]
J. Min, Y. Zhao, C. Luo, M. Cho, Peripheral vision transformer, Ad- vances in Neural Information Processing Systems 35 (2022) 32097– 32111
work page 2022
-
[51]
S. Ren, X. Yang, S. Liu, X. Wang, Sg-former: Self-guided transformer with evolving token reallocation, Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (2023) 6003–6014
work page 2023
-
[52]
X. Huo, S. Tian, Y. Yang, L. Yu, W. Zhang, A. Li, Spa: Self-peripheral- attention for central-peripheral interactions in endoscopic image classifi- cation and segmentation, Expert Systems with Applications 245 (2024) 123053. 23
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.