pith. sign in

arxiv: 2604.17360 · v1 · submitted 2026-04-19 · 💻 cs.AI

T-DuMpRa: Teacher-guided Dual-path Multi-prototype Retrieval Augmented framework for fine-grained medical image classification

Pith reviewed 2026-05-10 06:22 UTC · model grok-4.3

classification 💻 cs.AI
keywords fine-grained medical image classificationmulti-prototype retrievalteacher-guided learningconfidence-gated fusionskin lesion classificationambiguous casesEMA teachercontrastive embedding learning
0
0 comments X

The pith

A teacher-guided dual-path framework with multi-prototype retrieval and confidence-gated fusion improves accuracy on visually ambiguous medical images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops T-DuMpRa to address fine-grained medical image classification where subtle inter-class differences create visually ambiguous cases that produce uncertain predictions. It combines a standard discriminative classifier with a parallel retrieval path that matches embeddings against a bank of prototypes derived from clustered teacher-model representations. Training uses both cross-entropy and supervised contrastive losses to produce cosine-compatible embeddings, while an EMA teacher supplies stable representations for the memory bank. At inference a conservative gate fuses the two signals only when the classifier shows uncertainty and the prototype matches strongly conflict with it, leaving high-confidence outputs unchanged. Experiments on HAM10000 and ISIC2019 report modest gains across five backbones and visualizations indicate better separation of ambiguous examples.

Core claim

The T-DuMpRa framework jointly optimizes discriminative classification and multi-prototype retrieval during training by using an EMA teacher to build a clustered memory bank in embedding space, then at inference fuses the classifier distribution with cosine similarity to the prototypes through a conservative confidence gate that activates retrieval solely when the base prediction is uncertain and the retrieval evidence is decisive and conflicting.

What carries the argument

The confidence-gated fusion mechanism that selectively combines the base classifier output with cosine similarity scores to a multi-prototype memory bank constructed from EMA teacher embeddings, activating only on uncertain and conflicting cases.

If this is right

  • The framework can be attached to any existing backbone by adding a compact prototype bank without retraining the original model from scratch.
  • Joint cross-entropy and contrastive training produces embeddings that support both classification and reliable prototype matching.
  • The EMA teacher supplies smoother representations that enable stable clustering into multiple prototypes per class.
  • The conservative gate leaves confident correct predictions untouched while targeting only the ambiguous subset.
  • Visualization of activation patterns confirms the method focuses retrieval on visually similar inter-class examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selective activation logic could be tested on other fine-grained domains such as plant species or product variants where uncertainty also signals visual overlap.
  • Replacing the fixed prototype bank with an online-updating version might allow the method to adapt to distribution shift without full retraining.
  • Varying the uncertainty and conflict thresholds per dataset could reveal whether the reported gains are conservative or near-optimal.
  • The dual-path training might be extended by adding a third path that learns to predict when retrieval will be helpful, turning the gate into a learned component.

Load-bearing premise

The gated fusion will activate retrieval exactly when it resolves ambiguity without introducing errors on predictions that are already correct but uncertain.

What would settle it

On the HAM10000 or ISIC2019 test sets, identify the subset of cases where the base classifier is uncertain yet correct, apply the fusion unconditionally, and check whether accuracy falls relative to the base classifier alone.

Figures

Figures reproduced from arXiv: 2604.17360 by Shen Zhao, Zixuan Tang.

Figure 1
Figure 1. Figure 1: The challenges in fine-grained medical image classification and our method’s overview. (a) shows visually ambiguous cases where different categories share similar patterns, leading to classifier uncertainty. (b) highlights intra-class diversity, demon￾strating the challenge of handling different appearances within the same category. (c) illustrates the shortcomings of the single-path framework, where predi… view at source ↗
Figure 2
Figure 2. Figure 2: The proposed teacher-guided prototype retrieval framework. (a) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results of ablation experiments for classifier confidence threshold [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples visualization. We visualized the results on HAM using the experimentally optimal hyperparameter setting with the ViT-B model. In this evaluation, we randomly selected four samples for analysis. fused prediction increases the BCC confidence, improving the overall accuracy. This shows that our gating mechanism effectively incorporates prototype re￾trieval when the classifier is uncertain… view at source ↗
read the original abstract

Fine-grained medical image classification is challenged by subtle inter-class variations and visually ambiguous cases, where confidence estimates often exhibit uncertainty rather than being overconfident. In such scenarios, purely discriminative classifiers may achieve high overall accuracy yet still fail to distinguish between highly similar categories, leading to miscalibrated predictions. We propose T-DuMpRa, a teacher-guided dual-path multi-prototype retrieval-augmented framework, where discriminative classification and multi-prototype retrieval jointly drive both training and prediction. During training, we jointly optimize cross-entropy and supervised contrastive objectives to learn a cosine-compatible embedding geometry for reliable prototype matching. We further employ an exponential moving average (EMA) teacher to obtain smoother representations and build a multi-prototype memory bank by clustering teacher embeddings in the teacher embedding space. Our framework is plug-and-play: it can be easily integrated into existing classification models by constructing a compact prototype bank, thereby improving performance on visually ambiguous cases. At inference, we combine the classifier's predicted distribution with a similarity-based distribution computed via cosine matching to prototypes, and apply a conservative confidence-gated fusion that activates retrieval only when the classifier's prediction is uncertain and the retrieval evidence is decisive and conflicting, otherwise keeping confident predictions unchanged. On HAM10000 and ISIC2019, our method yields 0.68%-0.21% and 0.44%-2.69% improvements on 5 different backbones. And visualization analysis proves our model can enhance the model's ability to handle visually ambiguous cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces T-DuMpRa, a teacher-guided dual-path multi-prototype retrieval-augmented framework for fine-grained medical image classification. It jointly trains a classifier with cross-entropy and supervised contrastive losses, uses an EMA teacher to build a multi-prototype memory bank from clustered embeddings, and at inference fuses the classifier's distribution with a prototype similarity distribution using a conservative confidence gate that only activates retrieval for uncertain and conflicting cases. The authors claim small but consistent improvements on HAM10000 (0.21-0.68%) and ISIC2019 (0.44-2.69%) across five backbones, with visualizations suggesting better handling of ambiguous cases.

Significance. If validated, this work could offer a lightweight, plug-and-play method to boost performance of standard backbones on medical datasets with high visual similarity between classes. The conservative gating strategy is a positive aspect to prevent degradation on easy cases. The gains are modest, so the significance would be in providing a practical tool rather than a breakthrough in accuracy.

major comments (3)
  1. Abstract: The reported performance improvements are given as ranges without specifying per-backbone results, statistical significance, or number of runs, which is critical to evaluate if the gains are reliable and attributable to the proposed fusion mechanism rather than training variations.
  2. Inference mechanism (as described in abstract): The confidence-gated fusion is presented qualitatively without quantitative analysis of activation frequency, false positive rate on non-ambiguous cases, or ablation removing the gate; this directly impacts whether the central claim that retrieval augmentation enhances ambiguous case handling holds.
  3. Method description: No ablation studies are described to separate the contributions of the joint training objectives, EMA teacher, and the inference-time fusion, making it difficult to confirm that the dual-path aspect is responsible for the observed improvements on the two datasets.
minor comments (2)
  1. Abstract: The improvement ranges are written as '0.68%-0.21%' which is non-standard ordering and unclear; it should be clarified if this is the range across backbones or something else.
  2. The paper would benefit from including the exact values of free parameters such as EMA decay rate and number of prototypes per class in the main text or appendix for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help us improve the clarity and rigor of the manuscript. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses
  1. Referee: Abstract: The reported performance improvements are given as ranges without specifying per-backbone results, statistical significance, or number of runs, which is critical to evaluate if the gains are reliable and attributable to the proposed fusion mechanism rather than training variations.

    Authors: We agree that the abstract summary could be more precise. The ranges (0.21-0.68% on HAM10000 and 0.44-2.69% on ISIC2019) are used for brevity to convey the consistent gains across backbones. Detailed per-backbone results are already provided in Tables 1 and 2 of the main text. In the revised manuscript, we will update the abstract to explicitly note that experiments were run with fixed random seeds for reproducibility and to reference the per-backbone values and any variance reported in the tables. This will allow readers to better assess reliability without lengthening the abstract excessively. revision: yes

  2. Referee: Inference mechanism (as described in abstract): The confidence-gated fusion is presented qualitatively without quantitative analysis of activation frequency, false positive rate on non-ambiguous cases, or ablation removing the gate; this directly impacts whether the central claim that retrieval augmentation enhances ambiguous case handling holds.

    Authors: The abstract necessarily presents the gating strategy at a high level. The full manuscript includes qualitative visualizations and case studies showing improved handling of ambiguous examples. We acknowledge that quantitative support would strengthen the central claim. In the revision, we will add: (1) the percentage of test samples where the gate activates, (2) an analysis of false-positive activations (cases where the gate triggers but the classifier prediction was correct), and (3) an ablation comparing performance with the gate disabled. These additions will be placed in the experimental or analysis section. revision: yes

  3. Referee: Method description: No ablation studies are described to separate the contributions of the joint training objectives, EMA teacher, and the inference-time fusion, making it difficult to confirm that the dual-path aspect is responsible for the observed improvements on the two datasets.

    Authors: The current manuscript emphasizes the integrated framework and its overall results. We agree that component-wise ablations would help isolate contributions and confirm the value of the dual-path design. In the revised version, we will add ablation experiments that separately evaluate: (i) cross-entropy only versus joint cross-entropy + supervised contrastive loss, (ii) prototype bank construction with versus without the EMA teacher, and (iii) inference with versus without the gated fusion. These will be reported on both datasets to directly address the concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper describes a plug-and-play empirical architecture (joint CE + supervised contrastive training, EMA teacher for prototype bank construction, and conservative confidence-gated fusion at inference) whose performance claims are presented as measured improvements on HAM10000 and ISIC2019 across backbones, supported by visualization. No mathematical derivation chain exists that reduces a claimed prediction or result to its own inputs by construction; there are no equations shown that equate fitted parameters to outputs, no self-definitional loops, and no load-bearing self-citations or uniqueness theorems invoked to force the method. The reported gains and ambiguity-handling claims rest on external dataset evaluation rather than tautological re-expression of training objectives.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The framework introduces several empirical design choices including the dual-path structure, EMA teacher, clustering for prototypes, and the specific conservative fusion rule, none of which are theoretically derived but validated through experiments on medical datasets.

free parameters (3)
  • EMA decay rate
    Hyperparameter for updating the teacher model with exponential moving average; value not provided in abstract
  • Number of prototypes per class
    Determined via clustering of teacher embeddings; affects the granularity of the memory bank
  • Gating thresholds
    Confidence and conflict thresholds for deciding when to fuse retrieval output; not specified
axioms (2)
  • domain assumption Joint optimization of cross-entropy and supervised contrastive losses yields cosine-compatible embeddings
    Stated as the goal for reliable prototype matching
  • domain assumption Clustering in teacher embedding space produces useful multi-prototypes for retrieval
    Core to building the memory bank

pith-pipeline@v0.9.0 · 5575 in / 1699 out tokens · 110509 ms · 2026-05-10T06:22:48.938932+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 1 internal anchor

  1. [1]

    International Journal of Intelligent Systems2025(1), 3164952 (2025)

    Alam, F., Ullah, A., Shah, D., Ali, S., Tahir, M.: Artificial intelligence in melanoma detection: a review of current technologies and future directions. International Journal of Intelligent Systems2025(1), 3164952 (2025)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Aleem, S., Wang, F., Maniparambil, M., Arazo, E., Dietlmeier, J., Curran, K., Connor, N.E., Little, S.: Test-time adaptation with salip: A cascade of sam and clip for zero-shot medical image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5184–5193 (2024)

  3. [3]

    Sage Open5(4), 2158244015611451 (2015)

    Bresciani,S.,Eppler,M.J.:The pitfallsofvisualrepresentations: Areviewandclas- sification of common errors made while designing and interpreting visualizations. Sage Open5(4), 2158244015611451 (2015)

  4. [4]

    Annals of translational medicine8(11), 713 (2020)

    Cai, L., Gao, J., Zhao, D.: A review of the application of deep learning in medical image classification and segmentation. Annals of translational medicine8(11), 713 (2020)

  5. [5]

    IEEE Journal of Biomedical and Health Informatics (2025)

    Cao, L., Li, H., Dong, Y., Liu, T., Li, J.: Few-shot class-incremental learning with dynamic prototype refinement for brain activity classification. IEEE Journal of Biomedical and Health Informatics (2025)

  6. [6]

    Computers in biology and medicine185, 109507 (2025)

    Chen, C., Isa, N.A.M., Liu, X.: A review of convolutional neural network based methods for medical image classification. Computers in biology and medicine185, 109507 (2025)

  7. [7]

    In: International conference on medical image computing and computer-assisted intervention

    Chen, W., Wang, P., Ren, H., Sun, L., Li, Q., Yuan, Y., Li, X.: Medical image synthesisviafine-grainedimage-textalignmentandanatomy-pathologyprompting. In: International conference on medical image computing and computer-assisted intervention. pp. 240–250. Springer (2024)

  8. [8]

    Advances in neural information processing systems 35, 23049–23062 (2022)

    Chen, Z., Deng, Y., Wu, Y., Gu, Q., Li, Y.: Towards understanding the mixture-of- experts layer in deep learning. Advances in neural information processing systems 35, 23049–23062 (2022)

  9. [9]

    Medical Image Analysis76, 102313 (2022)

    Cheng, J., Tian, S., Yu, L., Gao, C., Kang, X., Ma, X., Wu, W., Liu, S., Lu, H.: Resganet: Residual group attention network for medical image classification and segmentation. Medical Image Analysis76, 102313 (2022)

  10. [10]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Cheng, P., Lin, L., Lyu, J., Huang, Y., Luo, W., Tang, X.: Prior: Prototype rep- resentation joint learning from medical images and reports. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 21361–21371 (2023)

  11. [11]

    The Lancet Digital Health4(5), e330–e339 (2022)

    Combalia, M., Codella, N., Rotemberg, V., Carrera, C., Dusza, S., Gutman, D., Helba, B., Kittler, H., Kurtansky, N.R., Liopyris, K., et al.: Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 international skin imaging collaboration grand challenge. The Lancet Digital Health4(5), e330–e339 (2022)

  12. [12]

    In: International Conference on Machine Learning

    Conti, J.R., Noiry, N., Clemencon, S., Despiegel, V., Gentric, S.: Mitigating gender bias in face recognition using the von mises-fisher mixture model. In: International Conference on Machine Learning. pp. 4344–4369. PMLR (2022)

  13. [13]

    Cochrane Database of Systematic Reviews (12) (2018)

    Dinnes, J., Deeks, J.J., Chuchu, N., di Ruffano, L.F., Matin, R.N., Thomson, D.R., Wong, K.Y., Aldridge, R.B., Abbott, R., Fawzy, M., et al.: Dermoscopy, with and without visual inspection, for diagnosing melanoma in adults. Cochrane Database of Systematic Reviews (12) (2018)

  14. [14]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) 16 Z. Tang et al

  15. [15]

    Advances in Neural Information Processing Systems34, 30284–30297 (2021)

    Englesson, E., Azizpour, H.: Generalized jensen-shannon divergence loss for learn- ing with noisy labels. Advances in Neural Information Processing Systems34, 30284–30297 (2021)

  16. [16]

    Ad- vances in neural information processing systems30(2017)

    Geifman, Y., El-Yaniv, R.: Selective classification for deep neural networks. Ad- vances in neural information processing systems30(2017)

  17. [17]

    Advances in Neural Information Processing Systems37, 111047–111073 (2024)

    Goren, S., Galil, I., El-Yaniv, R.: Hierarchical selective classification. Advances in Neural Information Processing Systems37, 111047–111073 (2024)

  18. [18]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Han, Z., Yang, F., Huang, J., Zhang, C., Yao, J.: Multimodal dynamics: Dynamical fusion for trustworthy multimodal classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20707–20717 (2022)

  19. [19]

    PET clinics 17(1), 1 (2022)

    Hasani, N., Morris, M.A., Rhamim, A., Summers, R.M., Jones, E., Siegel, E., Saboury, B.: Trustworthy artificial intelligence in medical imaging. PET clinics 17(1), 1 (2022)

  20. [20]

    von mises-fisher mixture model-based deep learning: Application to face verification,

    Hasnat, M.A., Bohné, J., Milgram, J., Gentric, S., Chen, L.: von mises-fisher mix- ture model-based deep learning: Application to face verification. arXiv preprint arXiv:1706.04264 (2017)

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)

  22. [22]

    He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

  23. [23]

    In: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

    Hu, P., Qin, Y., Gou, Y., Li, Y., Yang, M., Peng, X.: Probabilistic multimodal learning with von mises-fisher distributions. In: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence. pp. 5390–5398 (2025)

  24. [24]

    Hu, X., Zeng, D., Xu, X., Shi, Y.: Semi-supervised contrastive learning for label- efficientmedicalimagesegmentation.In:Internationalconferenceonmedicalimage computing and computer-assisted intervention. pp. 481–490. Springer (2021)

  25. [25]

    IEEE Access (2025)

    Hussain, T., Shouno, H., Hussain, A., Hussain, D., Ismail, M., Mir, T.H., Hsu, F.R., Alam, T., Akhy, S.A.: Effresnet-vit: A fusion-based convolutional and vision transformer model for explainable medical image classification. IEEE Access (2025)

  26. [26]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Huy, T.D., Tran, S.K., Nguyen, P., Tran, N.H., Sam, T.B., Van Den Hengel, A., Liao, Z., Verjans, J.W., To, M.S., Phan, V.M.H.: Interactive medical image analysis with concept-based similarity reasoning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 30797–30806 (2025)

  27. [27]

    Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems30(2017)

  28. [28]

    IEEE Access (2025)

    Khan, A., Rauf, Z., Khan, A.R., Rathore, S., Khan, S.H., Shah, N., Farooq, U., Asif, H., Asif, A., Zahoora, U., et al.: A recent survey of vision transformers for medical image segmentation. IEEE Access (2025)

  29. [29]

    Advances in neural information processing systems33, 18661–18673 (2020)

    Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D.: Supervised contrastive learning. Advances in neural information processing systems33, 18661–18673 (2020)

  30. [30]

    BMC medical imaging22(1), 69 (2022)

    Kim, H.E., Cosa-Linan, A., Santhanam, N., Jannesari, M., Maros, M.E., Gans- landt, T.: Transfer learning for medical image classification: a literature review. BMC medical imaging22(1), 69 (2022)

  31. [31]

    The lancet oncology3(3), 159–165 (2002)

    Kittler, H., Pehamberger, H., Wolff, K., Binder, M.: Diagnostic accuracy of der- moscopy. The lancet oncology3(3), 159–165 (2002)

  32. [32]

    Multimedia Tools and Applications83(7), 19683– 19728 (2024) T-DuMpRa 17

    Kumar, R., Kumbharkar, P., Vanam, S., Sharma, S.: Medical images classification using deep learning: a survey. Multimedia Tools and Applications83(7), 19683– 19728 (2024) T-DuMpRa 17

  33. [33]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, T., Cao, P., Yuan, Y., Fan, L., Yang, Y., Feris, R.S., Indyk, P., Katabi, D.: Targeted supervised contrastive learning for long-tailed recognition. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6918–6928 (2022)

  34. [34]

    IEEE Transactions on Neural Networks and Learning Systems (2025)

    Li, W., Peng, Y., Zhang, M., Ding, L., Hu, H., Shen, L.: Deep model fusion: A survey. IEEE Transactions on Neural Networks and Learning Systems (2025)

  35. [35]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Li, X., Li, J., Du, Z., Zhu, L., Shen, H.T.: Unified modality separation: A vision- language framework for unsupervised domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  36. [36]

    IEEE Journal of Biomedical and Health Informatics29(5), 3587–3597 (2025)

    Liang, X., Li, X., Li, F., Jiang, J., Dong, Q., Wang, W., Wang, K., Dong, S., Luo, G., Li, S.: Medfilip: Medical fine-grained language-image pre-training. IEEE Journal of Biomedical and Health Informatics29(5), 3587–3597 (2025)

  37. [37]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Liang,Y.,Chen,H.,Xiong,Y.,Zhou,Z.,Lyu,M.,Lin,Z.,Niu,S.,Zhao,S.,Han,J., Ding, G.: Advancing reliable test-time adaptation of vision-language models under visual variations. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 4788–4797 (2025)

  38. [38]

    IEEE Transactions on Medical Imaging43(2), 674–685 (2023)

    Ling, Y., Wang, Y., Dai, W., Yu, J., Liang, P., Kong, D.: Mtanet: Multi-task attention network for automatic medical image segmentation and classification. IEEE Transactions on Medical Imaging43(2), 674–685 (2023)

  39. [39]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Liu, F., Tian, Y., Chen, Y., Liu, Y., Belagiannis, V., Carneiro, G.: Acpl: Anti- curriculum pseudo-labelling for semi-supervised medical image classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 20697–20706 (2022)

  40. [40]

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

  41. [41]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

  42. [42]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Long, A., Yin, W., Ajanthan, T., Nguyen, V., Purkait, P., Garg, R., Blair, A., Shen, C., Van den Hengel, A.: Retrieval augmented classification for long-tail visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6959–6969 (2022)

  43. [43]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Manhardt, F., Arroyo, D.M., Rupprecht, C., Busam, B., Birdal, T., Navab, N., Tombari, F.: Explaining the ambiguity of object detection and 6d pose from visual data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6841–6850 (2019)

  44. [44]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Meng, M., Feng, D., Bi, L., Kim, J.: Correlation-aware coarse-to-fine mlps for de- formable medical image registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9645–9654 (2024)

  45. [45]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Mildenberger, D., Hager, P., Rueckert, D., Menten, M.J.: A tale of two classes: adapting supervised contrastive learning to binary imbalanced datasets. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 10305– 10314 (2025)

  46. [46]

    Advances in neural information processing systems 34, 14200–14213 (2021)

    Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bot- tlenecks for multimodal fusion. Advances in neural information processing systems 34, 14200–14213 (2021)

  47. [47]

    Tang et al

    Nguyen, T.T.D., Rezatofighi, H., Vo, B.N., Vo, B.T., Savarese, S., Reid, I.: How trustworthy are performance evaluations for basic vision tasks? IEEE Transactions on Pattern Analysis and Machine Intelligence45(7), 8538–8552 (2022) 18 Z. Tang et al

  48. [48]

    ACM Computing Surveys56(4), 1–41 (2023)

    Patrício,C.,Neves,J.C.,Teixeira,L.F.:Explainabledeeplearningmethodsinmed- ical image classification: A survey. ACM Computing Surveys56(4), 1–41 (2023)

  49. [49]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Pellicer, A.L., Mariucci, A., Angelov, P., Bukhari, M., Kerns, J.G.: Protomedx: Towards explainable multi-modal prototype learning for bone health classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7357–7366 (2025)

  50. [50]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Rao, B., Liao, H., Guan, Y., Wang, C., Wang, B., Zhang, J., Li, Z.: Amd: Adap- tive momentum and decoupled contrastive learning framework for robust long-tail trajectory prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 28849–28858 (2025)

  51. [51]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Sacha, M., Rymarczyk, D., Struski, Ł., Tabor, J., Zieliński, B.: Protoseg: Inter- pretable semantic segmentation with prototypical parts. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1481– 1492 (2023)

  52. [52]

    IEEE Signal Processing Letters31, 1109–1113 (2024)

    Shao, R., Bi, X.J., Chen, Z.: Hybrid vit-cnn network for fine-grained image classi- fication. IEEE Signal Processing Letters31, 1109–1113 (2024)

  53. [53]

    In: Interna- tional conference on medical image computing and computer-assisted intervention

    Sharma,S.,Kumar,A.,Chandra,J.:Confidencematters:Enhancingmedicalimage classification through uncertainty-driven contrastive self-distillation. In: Interna- tional conference on medical image computing and computer-assisted intervention. pp. 133–142. Springer (2024)

  54. [54]

    Ad- vances in neural information processing systems30(2017)

    Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. Ad- vances in neural information processing systems30(2017)

  55. [55]

    Journal of Electronic Imaging33(3), 033013–033013 (2024)

    Song, W., Chen, D.: Posture-guided part learning for fine-grained image catego- rization. Journal of Electronic Imaging33(3), 033013–033013 (2024)

  56. [56]

    Multimedia Tools and Applications83(9), 27305–27329 (2024)

    Spolaor, N., Lee, H.D., Mendes, A.I., Nogueira, C.V., Parmezan, A.R.S., Takaki, W.S.R., Coy, C.S.R., Wu, F.C., Fonseca-Pinto, R.: Fine-tuning pre-trained neural networks for medical image classification in small clinical datasets. Multimedia Tools and Applications83(9), 27305–27329 (2024)

  57. [57]

    Advances in neural information processing systems33, 6100– 6110 (2020)

    Sutter,T.,Daunhawer,I.,Vogt,J.:Multimodalgenerativelearningutilizingjensen- shannon-divergence. Advances in neural information processing systems33, 6100– 6110 (2020)

  58. [58]

    In: International conference on machine learning

    Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)

  59. [59]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Tang,Z.,Sun,B.,He,S.,Hong,Y.,Yu,D.,Liu,Z.,Li,M.,Chen,B.,Zhao,S.:Mibf- net: Multi-modal information balanced fusion network for clinical diagnosis via patient narratives and lesion image. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 366–375. Springer (2025)

  60. [60]

    Advances in neural information processing systems30(2017)

    Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems30(2017)

  61. [61]

    Journal of Oral Biosciences64(3), 312–320 (2022)

    Tsuneki, M.: Deep learning models in medical image analysis. Journal of Oral Biosciences64(3), 312–320 (2022)

  62. [62]

    Advances in Neural Information Processing Systems35, 18034–18045 (2022)

    Valmadre, J.: Hierarchical classification at multiple operating points. Advances in Neural Information Processing Systems35, 18034–18045 (2022)

  63. [63]

    Medical image analysis79, 102470 (2022)

    Van der Velden, B.H., Kuijf, H.J., Gilhuijs, K.G., Viergever, M.A.: Explainable artificial intelligence (xai) in deep learning-based medical image analysis. Medical image analysis79, 102470 (2022)

  64. [64]

    In: International Conference on Medical Image Computing and Computer- Assisted Intervention

    Wang, K., Zhan, B., Zu, C., Wu, X., Zhou, J., Zhou, L., Wang, Y.: Tripled- uncertainty guided mean teacher model for semi-supervised medical image segmen- T-DuMpRa 19 tation. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp. 450–460. Springer (2021)

  65. [65]

    The Lancet Digital Health 4(1), e64–e74 (2022)

    Wen, D., Khan, S.M., Xu, A.J., Ibrahim, H., Smith, L., Caballero, J., Zepeda, L., de Blas Perez, C., Denniston, A.K., Liu, X., et al.: Characteristics of publicly avail- able skin cancer image datasets: a systematic review. The Lancet Digital Health 4(1), e64–e74 (2022)

  66. [66]

    In: International Conference on Machine Learning

    Wen, Z., Li, Y.: Toward understanding the feature learning process of self- supervised contrastive learning. In: International Conference on Machine Learning. pp. 11112–11122. PMLR (2021)

  67. [67]

    Neural Networks187, 107311 (2025)

    Xu, Y., Wang, D., Zhang, L., Zhang, L.: Dual selective fusion transformer network for hyperspectral image classification. Neural Networks187, 107311 (2025)

  68. [68]

    Pattern Recognition p

    Yang, M., Zhou, Z., Gong, W.: Revisiting the representation learning in long-tailed medical image classification. Pattern Recognition p. 112683 (2025)

  69. [69]

    IEEE transactions on pattern analysis and machine intelligence43(9), 3126–3137 (2020)

    Zadeh, S.G., Schmid, M.: Bias in cross-entropy-based training of deep survival networks. IEEE transactions on pattern analysis and machine intelligence43(9), 3126–3137 (2020)

  70. [70]

    IEEE Transactions on Neural Networks and Learning Systems (2025)

    Zhao, L., Chen, X., Chen, E.Z., Liu, Y., Chen, T., Sun, S.: Retrieval-augmented few-shot medical image segmentation with foundation models. IEEE Transactions on Neural Networks and Learning Systems (2025)

  71. [71]

    Advances in Neu- ral Information Processing Systems35, 7103–7114 (2022)

    Zhou, Y., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A.M., Le, Q.V., Laudon, J., et al.: Mixture-of-experts with expert choice routing. Advances in Neu- ral Information Processing Systems35, 7103–7114 (2022)

  72. [72]

    Medical Image Analysis 97, 103281 (2024)

    Zhu, Y., Wang, S., Yu, H., Li, W., Tian, J.: Sfpl: Sample-specific fine-grained proto- type learning for imbalanced medical image classification. Medical Image Analysis 97, 103281 (2024)

  73. [73]

    cor- rectness

    Zhu, Z., Yu, K., Qi, G., Cong, B., Li, Y., Li, Z., Gao, X.: Lightweight medical image segmentation network with multi-scale feature-guided fusion. Computers in Biology and Medicine182, 109204 (2024) 20 Z. Tang et al. A Effectiveness Analysis of Confidence-Gated Prototype Retrieval This appendix provides a theoretical justification for the proposed confide...