pith. sign in

arxiv: 2605.21970 · v1 · pith:BI4M5YDSnew · submitted 2026-05-21 · 📡 eess.IV · cs.CV

Entropy-Guided Self-Supervised Learning for Medical Image Classification

Pith reviewed 2026-05-22 03:13 UTC · model grok-4.3

classification 📡 eess.IV cs.CV
keywords self-supervised learningmedical image classificationmasked autoencoderensemble learningtransfer learningConvNeXtentropy-guided pretraining
0
0 comments X

The pith

An ensemble averaging predictions from ImageNet-pretrained and entropy-guided MAE-pretrained ConvNeXt models improves medical image classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that pairs two ConvNeXt-Tiny models to handle scarce labels and subtle class differences in medical images. One model draws general features from ImageNet pre-training while the other learns domain-specific representations through entropy-guided masked autoencoding on the target medical data. Both are fine-tuned on the classification tasks, then their probability outputs are averaged to form the final prediction. Experiments on breast ultrasound, skin lesion, gastrointestinal, and COVID datasets show the combined system exceeds the accuracy of either model alone and of prior methods.

Core claim

The paper claims that pre-training one ConvNeXt-Tiny on ImageNet and a second on the medical dataset via entropy-guided masked autoencoding, followed by fine-tuning and simple probability averaging, produces complementary features that yield state-of-the-art classification accuracy and robustness on the BUSI, ISIC 2018, Kvasir, and COVID-19 datasets.

What carries the argument

Ensemble formed by averaging the predicted probabilities of an ImageNet-pretrained ConvNeXt-Tiny and an entropy-guided MAE-pretrained ConvNeXt-Tiny after task-specific fine-tuning.

If this is right

  • Domain-specific MAE pre-training supplies features that complement the general features from ImageNet.
  • Probability averaging produces higher accuracy than either model used alone.
  • The method reaches state-of-the-art results on four distinct medical imaging modalities.
  • Combining broad and narrow pre-training strategies mitigates limited annotated data and high intra-class variability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-pretraining pattern could be tried with other backbone families such as Vision Transformers.
  • Entropy-guided masking might be tested inside other self-supervised objectives beyond MAE.
  • Weighted fusion or learned combination layers could replace simple averaging in future variants.
  • The approach may transfer to other data-scarce domains such as satellite or microscopy images.

Load-bearing premise

The two pre-trained models supply complementary features that are effectively combined simply by averaging their predicted probabilities.

What would settle it

If the ensemble fails to outperform both individual models and existing methods on a fifth independent medical imaging dataset, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.21970 by Joao Florindo, Viviane Moura.

Figure 1
Figure 1. Figure 1: Flow diagram of the proposed methodology: Synergy between the self-supervised [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy evolution of the Ensemble model varying the proportion of epochs [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
read the original abstract

Accurate and robust medical image classification is paramount for early disease diagnosis and treatment planning. However, challenges such as limited annotated data, high intra-class variability, and subtle inter-class differences often hinder the performance of deep learning models. This paper introduces a synergistic deep learning framework that leverages the strengths of self-supervised learning and transfer learning for enhanced medical image classification. Our approach employs two distinct ConvNeXt-Tiny models: one pre-trained on a large-scale natural image dataset (ImageNet) and another pre-trained using an entropy-guided Masked Autoencoder (MAE) on the target medical dataset. Both models are then fine-tuned on specific medical image classification tasks. A final ensemble strategy, based on averaging predicted probabilities, is utilized to combine the complementary insights from these two models. Rigorous experimental validation across four diverse medical imaging datasets (Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC) 2018, Kvasir, and COVID) demonstrates the superior performance and robustness of our ensemble approach. The MAE pre-training significantly improves feature learning on domain-specific data, while the ImageNet pre-training provides strong generalizable features. The ensemble consistently achieves state-of-the-art results, outperforming individual models and existing methods, highlighting the efficacy of combining diverse pre-training strategies for challenging medical image analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a framework for medical image classification that combines two ConvNeXt-Tiny models: one pre-trained on ImageNet and one pre-trained via entropy-guided Masked Autoencoder (MAE) on the target medical dataset. Both are fine-tuned on the downstream tasks, and their predicted probabilities are averaged to form an ensemble. The authors claim this yields superior performance and robustness, achieving state-of-the-art results on the BUSI, ISIC 2018, Kvasir, and COVID datasets.

Significance. If the reported gains are statistically significant and arise from genuinely complementary features rather than correlated errors, the approach offers a practical recipe for blending general-domain and domain-specific pre-training in label-scarce medical imaging settings. The entropy-guided MAE component is a concrete technical choice that could be adopted more broadly.

major comments (2)
  1. [Abstract and Results section] The central claim that the ensemble outperforms the individual models rests on the unverified assumption that the ImageNet-pretrained and entropy-guided MAE-pretrained ConvNeXt-Tiny models produce sufficiently uncorrelated errors. The manuscript supplies no prediction-correlation analysis, error-overlap statistics, or ablation comparing the ensemble to the stronger single model (see the ensemble description and results tables).
  2. [Experimental Results] Performance tables lack error bars, statistical significance tests (e.g., paired t-tests or McNemar tests on the reported accuracy/F1 gains), and explicit train/validation/test split details. Without these, the assertions of 'superior performance and robustness' and 'state-of-the-art results' cannot be rigorously evaluated.
minor comments (2)
  1. [Method] Clarify the precise entropy computation and masking schedule used in the MAE pre-training; a short pseudocode or equation would remove ambiguity.
  2. [Experimental Setup] Add references for the four datasets (BUSI, ISIC 2018, Kvasir, COVID) in the experimental setup if they are currently only named.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important opportunities to strengthen the statistical validation and analysis of our ensemble approach. We address each major comment below and will revise the manuscript to incorporate additional analyses and details.

read point-by-point responses
  1. Referee: [Abstract and Results section] The central claim that the ensemble outperforms the individual models rests on the unverified assumption that the ImageNet-pretrained and entropy-guided MAE-pretrained ConvNeXt-Tiny models produce sufficiently uncorrelated errors. The manuscript supplies no prediction-correlation analysis, error-overlap statistics, or ablation comparing the ensemble to the stronger single model (see the ensemble description and results tables).

    Authors: We agree that explicit evidence of complementarity would strengthen the central claim. The manuscript already shows the ensemble outperforming both individual models on all four datasets, which is consistent with the different pre-training regimes (general-domain vs. domain-specific) yielding complementary features. However, we did not include correlation or error-overlap analysis. In the revision we will add (1) pairwise prediction correlation coefficients between the two models and (2) an ablation table directly comparing ensemble performance to the stronger single model on each dataset. These additions will provide quantitative support for the assumption of sufficiently uncorrelated errors. revision: yes

  2. Referee: [Experimental Results] Performance tables lack error bars, statistical significance tests (e.g., paired t-tests or McNemar tests on the reported accuracy/F1 gains), and explicit train/validation/test split details. Without these, the assertions of 'superior performance and robustness' and 'state-of-the-art results' cannot be rigorously evaluated.

    Authors: We acknowledge that the current presentation of results can be made more rigorous. The manuscript reports mean metrics across multiple runs but does not display error bars or conduct formal significance testing, and the split details are described at a high level. We will revise the experimental section to (1) include standard deviations or error bars in all performance tables, (2) add McNemar’s test (or paired t-tests where appropriate) to assess statistical significance of the reported gains, and (3) provide explicit, reproducible descriptions of the train/validation/test splits (including random seeds and stratification strategy) for each of the four datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework with no derivations or self-referential reductions

full rationale

The paper presents a practical ensemble method using one ImageNet-pretrained ConvNeXt-Tiny and one entropy-guided MAE-pretrained ConvNeXt-Tiny, fine-tuned on target medical datasets and combined via simple probability averaging. No equations, fitted parameters, or derivation steps are described that reduce to their own inputs by construction. Claims rest on experimental results across external datasets (BUSI, ISIC 2018, Kvasir, COVID) rather than any self-citation chain or ansatz smuggled through prior work. The complementarity assumption is an empirical hypothesis, not a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard deep-learning assumptions such as the utility of pre-trained features and the benefit of model ensembles without introducing new free parameters, axioms, or invented entities.

axioms (1)
  • domain assumption Pre-trained features from ImageNet and from entropy-guided MAE on medical data are complementary and improve classification when averaged
    Stated in the abstract as the basis for the ensemble strategy

pith-pipeline@v0.9.0 · 5759 in / 1302 out tokens · 41469 ms · 2026-05-22T03:13:44.635621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 3 internal anchors

  1. [1]

    Yamashita, M

    R. Yamashita, M. Nishio, R. K. G. Do, K. Togashi, Convolutional neu- ral networks: an overview and application in radiology, Insights into Imaging 9 (2018) 611–629

  2. [2]

    Atasever, N

    S. Atasever, N. Azginoglu, D. S. Terzi, R. Terzi, A comprehensive sur- vey of deep learning research on medical image analysis with focus on transfer learning, Clinical Imaging 94 (2023) 18–41

  3. [3]

    Jiang, Z

    H. Jiang, Z. Diao, T. Shi, Y. Zhou, F. Wang, W. Hu, X. Zhu, S. Luo, G. Tong, Y.-D. Yao, A review of deep learning-based multiple-lesion recognition from medical images: classification, detection and segmen- tation, Computers in Biology and Medicine 157 (2023) 106726

  4. [4]

    R. R. Yellu, Y. Kukalakunta, P. Thunki, Medical image analysis- challenges and innovations: Studying challenges and innovations in med- ical image analysis for applications such as diagnosis, treatment plan- ning, and image-guided surgery, Journal of Artificial Intelligence Re- search and Applications 4 (1) (2024) 93–100

  5. [5]

    LeCun, B

    Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub- bard, L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural computation 1 (4) (1989) 541–551

  6. [6]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan, A. Zisserman, Very deep convolutional networks for large- scale image recognition, arXiv preprint arXiv:1409.1556 (2014). 18

  7. [7]

    K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition (2016) 770–778

  8. [8]

    D. G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision 60 (2) (2004) 91–110

  9. [9]

    A. W. Salehi, S. Khan, G. Gupta, B. I. Alabduallah, A. Almjally, H. Al- solai, T. Siddiqui, A. Mellit, A study of cnn and transfer learning in medical imaging: Advantages, challenges, future scope, Sustainability 15 (7) (2023) 5930

  10. [10]

    Spolaôr, H

    N. Spolaôr, H. D. Lee, A. I. Mendes, C. V. Nogueira, A. R. S. Parmezan, W.S.R.Takaki, F.C.Coy, F.C.Wu, R.Fonseca-Pinto, Fine-tuningpre- trained neural networks for medical image classification in small clinical datasets, Multimedia Tools and Applications 83 (9) (2024) 27305–27329

  11. [11]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

  12. [12]

    J. Guo, K. Han, H. Wu, Y. Tang, X. Chen, Y. Wang, C. Xu, Cmt: Con- volutional neural networks meet vision transformers, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) 12175–12185

  13. [13]

    W. Lin, Z. Wu, J. Chen, J. Huang, L. Jin, Scale-aware modulation meet transformer, Proceedings of the IEEE/CVF International Conference on Computer Vision (2023) 6015–6026

  14. [14]

    Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, Pro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion (2021) 10012–10022

  15. [15]

    Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A con- vnet for the 2020s, Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (2022) 11976–11986. 19

  16. [16]

    Azizi, B

    S. Azizi, B. Mustafa, F. Ryan, Z. Beaver, J. Freyberg, A. Deaton, A. Loh, A. Karthikesalingam, S. Kornblith, T. Chen, et al., Big self- supervised models advance medical image classification, Proceedings of the IEEE/CVF International Conference on Computer Vision (2021) 3478–3488

  17. [17]

    Nielsen, L

    M. Nielsen, L. Wenderoth, T. Sentker, R. Werner, Self-supervision for medical image classification: State-of-the-art performance with˜ 100 la- beled training samples per class, Bioengineering 10 (8) (2023) 895

  18. [18]

    K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked au- toencoders are scalable vision learners, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) 16000– 16009

  19. [19]

    B. Chen, D. Martí Monsó, Y. Du, M. Simchowitz, R. Tedrake, V. Sitz- mann, Diffusion forcing: Next-token prediction meets full-sequence dif- fusion, Advances in Neural Information Processing Systems 37 (2024) 24081–24125

  20. [20]

    Huang, Z

    G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely con- nected convolutional networks, Proceedings of the IEEE conference on computer vision and pattern recognition (2017) 4700–4708

  21. [21]

    Jiang, L

    Y. Jiang, L. Chen, H. Zhang, X. Xiao, Breast cancer histopathological image classification using convolutional neural networks with small se- resnet module, PloS one 14 (3) (2019) e0214587

  22. [22]

    Z. Cai, Y. Chen, J. Wang, X. He, Z. Pei, X. Lei, C. Lu, Dafnet: A novel dynamic adaptive fusion network for medical image classification, Information Fusion 126 (2026) 103507

  23. [23]

    Z. Ren, S. Liu, L. Wang, Z. Guo, Conv-sdmlpmixer: A hybrid medical image classification network based on multi-branch cnn and multi-scale multi-dimensional mlp, Information Fusion 118 (2025) 102937

  24. [24]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). 20

  25. [25]

    O. N. Manzari, H. Ahmadabadi, H. Kashiani, S. B. Shokouhi, A. Aya- tollahi, Medvit: a robust vision transformer for generalized medical im- age classification, Computers in Biology and and Medicine 157 (2023) 106791

  26. [26]

    O. N. Manzari, H. Asgariandehkordi, T. Koleilat, et al., Medical image classificationwithkan-integratedtransformersanddilatedneighborhood attention, Applied Soft Computing (2025)

  27. [27]

    Y. Yue, Z. Li, Medmamba: Vision mamba for medical image classifica- tion, arXiv.org (2024)

  28. [28]

    Sevinç, M

    A. Sevinç, M. Ucan, B. Kaya, A distillation approach to transformer- based medical image classification with limited data, Diagnostics (2025)

  29. [29]

    X. Wu, Y. Feng, H. Xu, Z. Lin, T. Chen, S. Li, S. Qiu, Q. Liu, Y. Ma, S. Zhang, Ctranscnn: Combining transformer and cnn in multil- abel medical image classification, Knowledge-Based Systems 281 (2023) 111030

  30. [30]

    X. Huo, G. Sun, S. Tian, Y. Wang, L. Yu, J. Long, W. Zhang, A. Li, Hifuse: Hierarchical multi-scale feature fusion network for medical im- age classification, Biomedical Signal Processing and Control 87 (2024) 105534

  31. [31]

    Hussain, H

    T. Hussain, H. Shouno, A. Hussain, D. Hussain, M. Ismail, T. H. Mir, F. R. Hsu, T. Alam, S. A. Akhy, Effresnet-vit: A fusion-based con- volutional and vision transformer model for explainable medical image classification, IEEE Access (2025)

  32. [32]

    Djoumessi, S

    K. Djoumessi, S. O. Mensah, P. Berens, A hybrid fully convolutional cnn-transformer model for inherently interpretable medical image clas- sification, arXiv.org (2025)

  33. [33]

    R. Lu, L. Yu, S. Tian, Y. Xiao, Biologically inspired vision fusion: Central-peripheral synergy for medical image classification, Engineer- ing Applications of Artificial Intelligence 169 (2026) 114026

  34. [34]

    X. Kong, X. Zhang, Understanding masked image modeling via learning occlusion invariant feature, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023) 6241–6251. 21

  35. [35]

    J. Xu, S. Stirenko, J. Mao, Self-supervised model based on masked autoencoders advance ct scans classification, arXiv preprint arXiv:2210.05073 (2022)

  36. [36]

    J. Mao, S. Guo, X. Yin, Y. Chang, B. Nie, Y. Wang, Medical supervised masked autoencoder: Crafting a better masking strategy and efficient fine-tuning schedule for medical image classification, Applied Soft Com- puting 169 (2025) 112536

  37. [37]

    A. Sagar, Pmaf loss: Probabilistic margin-aware focal loss for robust medical image classification, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 344–349

  38. [38]

    Y. Yang, H. Fu, A. I. Avilés-Rivero, Z. Xing, L. Zhu, Diffmic-v2: Medical image classification via improved diffusion network, IEEE Transactions on Medical Imaging 44 (5) (2025) 2244–2255

  39. [39]

    Zhang, Z

    X. Zhang, Z. Xiao, J. Ma, et al., Adaptive dual-axis style-based recal- ibration network with class-wise statistics loss for imbalanced medical image classification, IEEE Transactions on Image Processing (2025)

  40. [40]

    J. Hu, Y. Xiang, Y. Lin, J. Du, H. Zhang, H. Liu, Multi-scale trans- former architecture for accurate medical image classification, in: Pro- ceedings of the 2025 International Conference on Artificial Intelligence and Computational Intelligence, 2025, pp. 409–414

  41. [41]

    Dehbozorgi, O

    P. Dehbozorgi, O. Ryabchykov, T. W. Bocklitz, A comparative study of statistical, radiomics, and deep learning feature extraction techniques for medical image classification in optical and radiological modalities, Computers in biology and medicine 187 (2025) 109768

  42. [42]

    Sakirin, R

    T. Sakirin, R. B. Said, Application of deep learning and transfer learning techniques for medical image classification, Edraak 2025 (2025) 38–46

  43. [43]

    J. Qiu, J. Cao, Y. Huang, Z. Zhu, F. Wang, C. Lu, Y. Li, Y. Zheng, Muscle: A new perspective to multi-scale fusion for medical image classi- fication based on the theory of evidence, IEEE Transactions on Medical Imaging 45 (3) (2026) 893–905

  44. [44]

    Al-Dhabyani, M

    W. Al-Dhabyani, M. Gomaa, H. Khaled, A. Fahmy, Dataset of breast ultrasound images, Data in brief 28 (2020) 104863. 22

  45. [45]

    Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

    N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gut- man, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al., Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic), arXiv preprint arXiv:1902.03368 (2019)

  46. [46]

    Pogorelov, K

    K. Pogorelov, K. R. Randel, C. Griwodz, S. L. Eskeland, T. de Lange, D. Johansen, C. Spampinato, D.-T. Dang-Nguyen, M. Lux, P. T. Schmidt, et al., Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection, Proceedings of the 8th ACM on Mul- timedia Systems Conference (2017) 164–169

  47. [47]

    M. E. H. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M. A. Kadir, Z. B. Mahbub, K. R. Islam, M. S. Khan, A. Iqbal, N. A. Emadi, et al., Can ai help in screening viral and covid-19 pneumonia?, IEEE Access 8 (2020) 132665–132676

  48. [48]

    Rahman, A

    T. Rahman, A. Khandakar, Y. Qiblawey, A. Tahir, S. Kiranyaz, S. B. A. Kashem, M. T. Islam, S. Al Maadeed, S. M. Zughaier, M. S. Khan, et al., Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images, Computers in Biology and Medicine 132 (2021) 104319

  49. [49]

    J. Yang, C. Li, X. Dai, J. Gao, Focal modulation networks, Advances in Neural Information Processing Systems 35 (2022) 4203–4217

  50. [50]

    J. Min, Y. Zhao, C. Luo, M. Cho, Peripheral vision transformer, Ad- vances in Neural Information Processing Systems 35 (2022) 32097– 32111

  51. [51]

    S. Ren, X. Yang, S. Liu, X. Wang, Sg-former: Self-guided transformer with evolving token reallocation, Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (2023) 6003–6014

  52. [52]

    X. Huo, S. Tian, Y. Yang, L. Yu, W. Zhang, A. Li, Spa: Self-peripheral- attention for central-peripheral interactions in endoscopic image classifi- cation and segmentation, Expert Systems with Applications 245 (2024) 123053. 23