pith. sign in

arxiv: 2605.18491 · v1 · pith:EEI5EUCRnew · submitted 2026-05-18 · 💻 cs.CV

Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks

Pith reviewed 2026-05-20 11:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learningmedical image segmentationtransfer learningmasked image modelingself-distillationCTMRIfew-shot learning
0
0 comments X

The pith

Self-distilled masked image modeling with local and global distillation achieves best transfer to medical segmentation tasks across modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates nine self-supervised learning methods by pretraining them on over 10,000 CT scans and then fine-tuning the encoders for segmentation on nine different tasks involving CT and MRI images. The results show that a method called SMIT, which combines masked image modeling with self-distillation, delivers the highest accuracy, converges fastest during fine-tuning, and maintains strong performance even with limited labeled data. This indicates that for medical imaging where annotations are expensive, certain combinations of pretext tasks produce more transferable features than standard contrastive or predictive approaches. The study also finds that method differences are most pronounced in low-data regimes and that feature reuse patterns are more consistent for the top method.

Core claim

The central claim is that self-distilled masked image transformer (SMIT), which combines masked image modeling (MIM) with local and global self-distillation, achieves the highest overall segmentation accuracy across the nine tasks, the fastest fine-tuning convergence, and the smallest few-shot-to-many-shot performance gap, indicating the strongest data efficiency. SMIT also showed the most consistent feature-reuse patterns between few- and many-shot fine tuning. MIM-based SimMIM and self-distillation methods (DINO, iBOT) outperformed contrastive learning and rotation prediction, which rely on image-level global representations. Differences between SSL methods were largest in the few-shot and

What carries the argument

Self-distilled masked image transformer (SMIT) that integrates masked image modeling with local and global self-distillation, serving as the encoder in a SwinUNETR-style segmentation network.

If this is right

  • MIM-based SimMIM and self-distillation methods outperform contrastive learning and rotation prediction in transfer to segmentation tasks.
  • Performance gaps between SSL methods are largest in few-shot settings and narrow as the size of the labeled fine-tuning dataset increases.
  • SMIT exhibits the most consistent feature-reuse patterns between few-shot and many-shot fine-tuning.
  • The choice of SSL pretraining matters most under limited annotation budgets for medical segmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pretraining CT dataset's coverage of disease sites and anatomical variation does not fully overlap with the downstream tasks, part of SMIT's measured edge may trace to dataset similarity instead of the pretext-task design.
  • The finding that hybrid MIM-plus-distillation yields stronger data efficiency points toward testing whether the same pattern holds when the decoder is also transformer-based rather than a 3D CNN.
  • Benchmarking results like these could guide selection of initialization strategies in clinical pipelines where annotation budgets are fixed and cross-modality transfer is required.

Load-bearing premise

The 10,412 CT scans used for pretraining are representative enough of the anatomical and pathological variability present in the nine downstream segmentation tasks, including the MRI modality transfers, so that observed performance differences can be attributed primarily to the choice of SSL pretext task rather than dataset mismatch.

What would settle it

Retraining the nine SSL methods on a pretraining set that includes substantial MRI scans and then re-evaluating whether SMIT still shows the largest advantage on the MRI segmentation tasks would test if the reported superiority holds when modality distribution is balanced.

Figures

Figures reproduced from arXiv: 2605.18491 by Harini Veeraraghavan, Jue Jiang.

Figure 1
Figure 1. Figure 1: (a) and (b) illustrate the SSL pretraining methods and downstream tasks. (c) summarizes the analyses conducted in this paper. 2 Related works SSL pretraining is a highly effective method for medical image analysis tasks including segmenta￾tion1,2,9,12,15,19, detection and classification2,7,8,11. Detailed overview of SSL methods for medical 3 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Impact of SSL pretraining methods applied to downstream segmentation tasks involving CT and MRI using the SwinUNETR-style segmentation network (Swin Transformer encoder with CNN decoder). of presentation. SSL methods using MIM, including SMIT and SimMIM outperformed all other methods in both modalities. SMIT was the most accurate with an average accuracy of 0.80 for CT and 0.79 for MRI. Self-distillation b… view at source ↗
Figure 3
Figure 3. Figure 3: Grouped accuracies for small (Left adrenal, right adrenal, gall bladder), large (liver, left kidney, right kidney, spleen), gastrointestinal (GI) organs (stomach, duodenum, pancreas, esophagus), and lung tumor using SSL methods for CT (a,b) and MRI (c,d) [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: DSC accuracy difference between models for the same structures applied to MRI and CT. Signif￾icance test results are indicated as *: p < 0.05; **: p < 0.01; ***: p < 0.001. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training loss and validation accuracy curves for analyzed pretrained models applied to segmenting (a) abdomen organs from CT (b) abdomen organs from MRI (c) liver tumor from CT (d) kidney tumor from CT (e) lung tumor from CT, and (f) lung tumor from MRI. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Few-shot and many-shot accuracies with performance gap (%) shown for segmenting (a) abdomen organs from CT (b) abdomen organs from MRI (c) lung tumors from CT (d) kidney tumors from CT (e) liver tumors from CT, and (f) lung tumors from MRI. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Feature reuse analysis performed using CKA comparing pretrained versus finetuned models for 5-, 10-, and Many-shot regimes applied to segmenting (a) lung tumors from CT, (b) abdomen organs from CT, (c) lung tumors from MRI, and (d) abdomen organs from MRI. similar patterns of feature reuse between 5- and 10-shot training regimes for both organs and lung tumor segmentation. However, feature reuse increased … view at source ↗
Figure 9
Figure 9. Figure 9: (a) Impact of pretraining data size on SMIT model segmentation accuracy for organs segmentation from CT, lung tumor segmentation from CT and MRI. (b) Impact of model size/capacity on multi-organ segmentation accuracy from CT. 5.7 Design experiments 5.7.1 Impact of pretraining data SwinUNETR and SwinUNETR∗ were evaluated for multiple tasks including CT and MRI organs and tumor segmentation. SwinUNETR∗ outpe… view at source ↗
read the original abstract

Methods: Nine SSL methods spanning four pretext-task families were pretrained from scratch using the same 10{,}412 3D CT scans (1.89~M 2D axial slices) covering varied disease sites. The pretrained Swin Transformer encoder from each method was integrated into a SwinUNETR-style segmentation network (Swin encoder with a 3D CNN decoder and skip connections) and fine-tuned on nine public segmentation tasks of varying complexity, including large abdominal organs, head-and-neck structures, and tumors from CT and MRI. Performance was assessed using Dice similarity coefficient (DSC). Fine-tuning convergence speed, transferability across modalities (CT-to-MRI), and feature-reuse patterns between few- and many-shot fine tuning were further analyzed using centered kernel alignment. Results: Self-distilled masked image transformer (SMIT), which combines masked image modeling (MIM) with local and global self-distillation, achieved the highest overall segmentation accuracy across the nine tasks, the fastest fine-tuning convergence, and the smallest few-shot-to-many-shot performance gap, indicating the strongest data efficiency. SMIT also showed the most consistent feature-reuse patterns between few- and many-shot fine tuning. MIM-based SimMIM and self-distillation methods (DINO, iBOT) outperformed contrastive learning and rotation prediction, which rely on image-level global representations. Differences between SSL methods were largest in the few-shot setting and narrowed as the size of the labeled fine-tuning dataset increased, indicating that the choice of SSL pretraining matters most under limited annotation budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks transferability of nine SSL pretraining methods spanning four pretext-task families. All methods are pretrained from scratch on the identical set of 10,412 3D CT scans (1.89 M axial slices) using a Swin Transformer encoder; the resulting encoders are inserted into a SwinUNETR-style segmentation network and fine-tuned on nine public CT and MRI segmentation tasks. Performance is measured by Dice similarity coefficient (DSC), with additional analyses of fine-tuning convergence speed, CT-to-MRI transfer, and feature reuse via centered kernel alignment (CKA). The central claim is that SMIT (masked image modeling combined with local and global self-distillation) yields the highest overall DSC, fastest convergence, smallest few-shot-to-many-shot gap, and most consistent feature reuse, while MIM-based and self-distillation methods generally outperform contrastive and rotation-prediction approaches, with larger gaps in the few-shot regime.

Significance. If the ranking holds under proper statistical controls, the work supplies a cleanly controlled empirical map of how different SSL pretext families transfer to same- and cross-modality medical segmentation. The uniform pretraining corpus and architecture isolate pretext-task effects, which is a genuine strength for attributing relative performance differences. The emphasis on few-shot regimes and data-efficiency metrics is practically relevant for annotation-scarce medical imaging settings.

major comments (2)
  1. [Methods] Methods section (experimental protocol): the description of the nine downstream tasks does not report exact train/validation/test splits, hyperparameter search ranges or budgets, or any statistical testing (e.g., paired tests or bootstrap confidence intervals) for the reported DSC rankings. Without these, the claim that SMIT is strictly highest overall and exhibits the smallest few-to-many-shot gap rests only on point estimates and cannot be considered robust.
  2. [Results] Results section (Tables/Figures reporting per-task and aggregate DSC): the manuscript presents SMIT as achieving the highest overall accuracy and most consistent CKA reuse, yet provides no quantitative assessment of whether the observed differences across the nine methods are statistically significant or could arise from task-specific variance. This directly affects the load-bearing conclusion that SMIT offers the strongest data efficiency.
minor comments (2)
  1. [Abstract] Abstract and §3: the phrase '1.89~M 2D axial slices' should clarify whether this count is exact or rounded and whether any slices were excluded during preprocessing.
  2. [§4.3] Figure captions and §4.3: the CKA heatmaps would benefit from explicit labeling of which layers correspond to the reported 'most consistent feature-reuse patterns' for SMIT versus baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor that will improve the clarity and robustness of our results. We address each major comment below and will revise the manuscript to incorporate the suggested details.

read point-by-point responses
  1. Referee: [Methods] Methods section (experimental protocol): the description of the nine downstream tasks does not report exact train/validation/test splits, hyperparameter search ranges or budgets, or any statistical testing (e.g., paired tests or bootstrap confidence intervals) for the reported DSC rankings. Without these, the claim that SMIT is strictly highest overall and exhibits the smallest few-to-many-shot gap rests only on point estimates and cannot be considered robust.

    Authors: We agree that explicit reporting of these details is necessary for full reproducibility and to support the robustness of our claims. The nine public downstream tasks follow the official train/validation/test splits provided by each dataset repository or original publication; we will add a dedicated table or subsection listing these splits for each task. Hyperparameter selection for fine-tuning was performed via grid search over standard ranges (learning rate, batch size, number of epochs, and optimizer settings) drawn from prior medical segmentation literature, with the final chosen values and search budget documented in the revised Methods. We will also add statistical testing, including bootstrap confidence intervals on DSC scores and paired Wilcoxon signed-rank tests across methods, to evaluate whether SMIT's advantages are statistically significant. These revisions will be included in the updated manuscript. revision: yes

  2. Referee: [Results] Results section (Tables/Figures reporting per-task and aggregate DSC): the manuscript presents SMIT as achieving the highest overall accuracy and most consistent CKA reuse, yet provides no quantitative assessment of whether the observed differences across the nine methods are statistically significant or could arise from task-specific variance. This directly affects the load-bearing conclusion that SMIT offers the strongest data efficiency.

    Authors: We acknowledge that the current results rely on point estimates without formal statistical quantification of differences. While the consistent ranking of SMIT across tasks and regimes (particularly the reduced few-to-many-shot gap) supports our conclusions, we agree that adding quantitative assessment of significance will strengthen the evidence. In the revision we will report bootstrap-derived confidence intervals for aggregate and per-task DSC values, along with p-values from appropriate non-parametric tests (e.g., Wilcoxon rank-sum) comparing SMIT against other methods. This will allow readers to distinguish reliable differences from task-specific variance. The core empirical findings remain unchanged, but the presentation will be updated to include these analyses. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmarking

full rationale

The paper performs controlled empirical comparisons of nine SSL pretext tasks, all pretrained from scratch on the identical 10,412 CT scans with the same Swin Transformer backbone before fine-tuning on nine separate public segmentation datasets. Performance metrics (DSC, convergence speed, CKA feature reuse) are measured directly on held-out downstream tasks rather than derived from any equations or fitted parameters internal to the study. No derivation chain, self-definitional relations, or load-bearing self-citations that reduce claims to inputs are present; relative differences are isolated by the uniform pretraining setup.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking study whose central claim rests on measured performance differences rather than new theoretical constructs. The main background assumptions concern the suitability of the Swin Transformer for medical volumes and the validity of Dice as a segmentation metric.

axioms (1)
  • domain assumption The Swin Transformer encoder pretrained via SSL can be directly integrated into a SwinUNETR-style segmentation network with a 3D CNN decoder and skip connections.
    This architectural choice is treated as standard and is not derived or justified within the reported experiments.

pith-pipeline@v0.9.0 · 5815 in / 1495 out tokens · 60863 ms · 2026-05-20T11:25:38.492588+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 1 internal anchor

  1. [1]

    Willemink, R.R Roth, and V Sandfort

    M.J. Willemink, R.R Roth, and V Sandfort. Toward foundational deep learning models for medical imaging in the new era of transformer networks.Radiol Artif Intell, 4(6), 2022. 23

  2. [2]

    Self-supervised learning for medical image analysis: Discriminative, restorative, or adversarial?Medical Image Analysis, 94:103086, 2024

    Fatemeh Haghighi, Mohammad Reza, Hosseinzadeh Taher, Michael .B Gotway, and Jianming Liang. Self-supervised learning for medical image analysis: Discriminative, restorative, or adversarial?Medical Image Analysis, 94:103086, 2024. ISSN 1361-8415. doi: https://doi.org/ 10.1016/j.media.2024.103086

  3. [3]

    Tuan Truong, Sadegh Mohammadi, and Matthias Lenga. How transferable are self-supervised features in medical image classification tasks? In Subhrajit Roy, Stephen Pfohl, Emma Ro- cheteau, Girmaw Abebe Tadesse, Luis Oala, Fabian Falck, Yuyin Zhou, Liyue Shen, Ghada Zamzmi, Purity Mugambi, Ayah Zirikly, Matthew B. A. McDermott, and Emily Alsentzer, editors,P...

  4. [4]

    Hospedales

    Linus Ericsson, Henry Gouk, and Timothy M. Hospedales. How well do self-supervised mod- els transfer? In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5410–5419, 2021

  5. [5]

    Self-supervised pretraining improves self-supervised pretraining

    Colorado J Reed, Xiangyu Yue, Ani Nrusimha, Sayna Ebrahimi, Vivek Vijaykumar, Richard Mao, Bo Li, Shanghang Zhang, Devin Guillory, Sean Metzger, Kurt Keutzer, and Trevor Darrell. Self-supervised pretraining improves self-supervised pretraining. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2584– 2594, Jan...

  6. [6]

    Swin transformers are robust to distribution and concept drift in endoscopy-based longitudinal rectal cancer assessment

    Jorge Tapias Gomez, Aneesh Rangnekar, Hannah Williams, Hannah Thompson, Julio Garcia- Aguilar, Joshua Jesse Smith, and Harini Veeraraghavan. Swin transformers are robust to distribution and concept drift in endoscopy-based longitudinal rectal cancer assessment. In Proc. SPIE 13406, Medical Imaging 2025: Image Processing,134061N, 2025

  7. [7]

    Self- supervised pretraining of visual features in the wild.CoRR, abs/2103.01988, 2021

    Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, and Piotr Bojanowski. Self- supervised pretraining of visual features in the wild.CoRR, abs/2103.01988, 2021. URL https://arxiv.org/abs/2103.01988

  8. [8]

    What makes transfer learning work for medical images: Feature reuse & other factors

    Christos Matsoukas, Johan Fredin Haslum, Moein Sorkhei, Magnus Söderberg, and Kevin Smith. What makes transfer learning work for medical images: Feature reuse & other factors. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9215–9224, 2022

  9. [9]

    Self-supervised pretraining for 2d medical image segmentation

    András Kalapos and Bálint Gyires-Tóth. Self-supervised pretraining for 2d medical image segmentation. In Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, editors,Computer Vision – ECCV 2022 Workshops, pages 472–484, Cham, 2023. Springer Nature Switzerland

  10. [10]

    Contrastive learning with continuous 24 proxy meta-data for 3d MRI classification

    B Dufumier, P Gori, J Victor, A Grigis, M Wessa, P Brambilla, P Favre, M Polosan, C McDon- ald, C.M Piguet, M.L Phillips, L Eyler, and E Duchesnay. Contrastive learning with continuous 24 proxy meta-data for 3d MRI classification. InMed Image Comput Computed Assisted Interv, volume 12902, pages 58–68. Springer, 2021

  11. [11]

    Dive into the details of self-supervised learning for medical image analysis.Medical Image Analysis, 89:102879, 2023

    Chuyan Zhang, Hao Zheng, and Yun Gu. Dive into the details of self-supervised learning for medical image analysis.Medical Image Analysis, 89:102879, 2023

  12. [12]

    Models genesis.Medical Image Analysis, 67:101840, 2021

    Zongwei Zhou, Vatsal Sodha, Jiaxuan Pang, Michael B Gotway, and Jianming Liang. Models genesis.Medical Image Analysis, 67:101840, 2021

  13. [13]

    3Dself-supervised methods for medical imaging.Advances in Neural Information Processing Systems, 33:18158–18172, 2020

    Aiham Taleb, Winfried Loetzsch, Noel Danz, Julius Severin, Thomas Gaertner, Benjamin Bergner, and Christoph Lippert. 3Dself-supervised methods for medical imaging.Advances in Neural Information Processing Systems, 33:18158–18172, 2020

  14. [14]

    Roth, and Daguang Xu

    Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R. Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In Alessandro Crimi and Spyridon Bakas, editors,Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, pages 272–284, Cham, 2022. Springer International Publishing

  15. [15]

    Self-supervised 3d anatomy segmentation using self-distilled masked image transformer (smit)

    Jue Jiang, Neelam Tyagi, Kathryn Tringale, Christopher Crane, and Harini Veeraraghavan. Self-supervised 3d anatomy segmentation using self-distilled masked image transformer (smit). InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 556–566. Springer, 2022

  16. [16]

    Self-supervised learning improves robustness of deep learning lung tumor segmentation models to ct imaging differences.Medical Physics, 52(3):1573–1588, 2025

    Jue Jiang, Aneesh Rangnekar, and Harini Veeraraghavan. Self-supervised learning improves robustness of deep learning lung tumor segmentation models to ct imaging differences.Medical Physics, 52(3):1573–1588, 2025

  17. [17]

    Auto-segmentation of neck nodal metastases using self-distilled masked image transformer on longitudinal mr images.BJR Artif Intell, 1(1), 2024

    R Paudyal, J Jiang, J Han, B.H Diplas, N Riaz, V Hatzoglou, N Lee, J Deasy, H Veeraraghavan, and A Dave. Auto-segmentation of neck nodal metastases using self-distilled masked image transformer on longitudinal mr images.BJR Artif Intell, 1(1), 2024

  18. [18]

    In:2025IEEE22ndInternationalSymposiumonBiomedicalImaging(ISBI).pp.1– 4 (2025)

    Jue Jiang and Harini Veeraraghavan. Benchmarking transferability of self-supervised pretrain- ingformulti-organsegmentationondifferentmodalities. In2025 IEEE 22nd International Sym- posium on Biomedical Imaging (ISBI),pages1–5, 2025. doi: 10.1109/ISBI60581.2025.10980778

  19. [19]

    Self-supervised pretraining in the wild imparts image acquisition robustness to medical image transformers: an application to lung cancer segmenta- tion

    Jue Jiang and Harini Veeraraghavan. Self-supervised pretraining in the wild imparts image acquisition robustness to medical image transformers: an application to lung cancer segmenta- tion. InMedical Imaging with Deep Learning, 2024. URLhttps://openreview.net/forum? id=G9Te2IevNm

  20. [20]

    Self-supervised visual represen- tation learning for medical image analysis: A comprehensive survey.Transactions on Ma- chine Learning Research, 2024

    Siladittya Manna, Saumik Bhattacharya, and Umapada Pal. Self-supervised visual represen- tation learning for medical image analysis: A comprehensive survey.Transactions on Ma- chine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id= 3Wg1oErMcJ. Survey Certification. 25

  21. [21]

    Covid- 19 prognosis via self-supervised representation learning and multi-image prediction.arXiv preprint arXiv:2101.04909, 2021

    Anuroop Sriram, Matthew Muckley, Koustuv Sinha, Farah Shamout, Joelle Pineau, Krzysztof J Geras, Lea Azour, Yindalon Aphinyanaphongs, Nafissa Yakubova, and William Moore. Covid- 19 prognosis via self-supervised representation learning and multi-image prediction.arXiv preprint arXiv:2101.04909, 2021

  22. [22]

    Contrastive learning of global and local features for medical image segmentation with limited annotations.Advances in neural information processing systems, 33:12546–12558, 2020

    Krishna Chaitanya, Ertunc Erdil, Neerav Karani, and Ender Konukoglu. Contrastive learning of global and local features for medical image segmentation with limited annotations.Advances in neural information processing systems, 33:12546–12558, 2020

  23. [23]

    Embedding task knowledge into 3d neural networks via self-supervised learning.arXiv preprint arXiv:2006.05798, 2020

    Jiuwen Zhu, Yuexiang Li, Yifan Hu, and S Kevin Zhou. Embedding task knowledge into 3d neural networks via self-supervised learning.arXiv preprint arXiv:2006.05798, 2020

  24. [24]

    Pgl: prior-guided local self-supervised learning for 3d medical image segmentation.arXiv preprint arXiv:2011.12640, 2020

    Yutong Xie, Jianpeng Zhang, Zehui Liao, Yong Xia, and Chunhua Shen. Pgl: prior-guided local self-supervised learning for 3d medical image segmentation.arXiv preprint arXiv:2011.12640, 2020

  25. [25]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF Int Conf. Computer Vision, pages 9650–9660, 2021

  26. [26]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  27. [27]

    Overcoming dimensional collapse in self-supervised contrastive learning for medical image segmentation

    Jamshid Hassanpour, Vinkle Kumar Srivastav, Didier Mutter, and Nicolas Padoy. Overcoming dimensional collapse in self-supervised contrastive learning for medical image segmentation. 2024 IEEE International Symposium on Biomedical Imaging (ISBI), pages 1–5, 2024. URL https://api.semanticscholar.org/CorpusID:267783037

  28. [28]

    Rubik’s cube+: A self-supervised feature learning framework for 3Dmedical image analysis.Medical Image Analysis, 64:101746, 2020

    Jiuwen Zhu, Yuexiang Li, Yifan Hu, Kai Ma, S Kevin Zhou, and Yefeng Zheng. Rubik’s cube+: A self-supervised feature learning framework for 3Dmedical image analysis.Medical Image Analysis, 64:101746, 2020

  29. [29]

    Eunji Jun, Seungwoo Jeong, Da-Woon Heo, and Heung-Il Suk.MedicalTransformer:Universal brain encoder for 3D MRIanalysis.arXiv preprint arXiv:2104.13633, 2021

  30. [30]

    Self-supervised learning for medical image analysis using image context restoration

    Liang Chen, Paul Bentley, Kensaku Mori, Kazunari Misawa, Michitaka Fujiwara, and Daniel Rueckert. Self-supervised learning for medical image analysis using image context restoration. Medical Image analysis, 58:101539, 2019

  31. [31]

    Parts2whole: Self- supervised contrastive learning via reconstruction

    Ruibin Feng, Zongwei Zhou, Michael B Gotway, and Jianming Liang. Parts2whole: Self- supervised contrastive learning via reconstruction. InDomain Adaptation and Representation Transfer, and Distributed and Collaborative Learning, pages 85–95. Springer, 2020. 26

  32. [32]

    A unified visual information preservation framework for self-supervised pre-training in medical image analysis

    Hong-Yu Zhou, Chixiang Lu, Chaoqi Chen, Sibei Yang, and Yizhou Yu. A unified visual information preservation framework for self-supervised pre-training in medical image analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

  33. [33]

    Unsupervised representation learning by predicting image rotations

    Nikos Komodakis and Spyros Gidaris. Unsupervised representation learning by predicting image rotations. InIntl Conf Learning Representations, 2018

  34. [34]

    Learning semantics-enriched representation via self-discovery, self- classification, and self-restoration

    Fatemeh Haghighi, Mohammad Reza Hosseinzadeh Taher, Zongwei Zhou, Michael B Got- way, and Jianming Liang. Learning semantics-enriched representation via self-discovery, self- classification, and self-restoration. InMedical Image Computing and Computer Assisted Inter- vention, pages 137–147. Springer, 2020

  35. [35]

    Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, et al.MST: Masked self-supervised transformer for visual representation.Adv. in Neu. Inf. Proc. Sys., 34:13165–13176, 2021

  36. [36]

    Simmim: A simple framework for masked image modeling

    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pages 9653–9663, 2022

  37. [37]

    Image BERT pre-training with online tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image BERT pre-training with online tokenizer. InIntl Conf. Learning Representations, 2022

  38. [38]

    Masked image modeling advances 3Dmedical image analysis

    Zekai Chen, Devansh Agarwal, Kshitij Aggarwal, Wiem Safta, Mariann Micsinai Balan, and Kevin Brown. Masked image modeling advances 3Dmedical image analysis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1970–1980, 2023

  39. [39]

    BEit: BERT pre-training of image transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEit: BERT pre-training of image transformers. InInternational Conference on Learning Representations, 2022

  40. [40]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pages 16000–16009, 2022

  41. [41]

    Stare at what you see: Masked image modeling without reconstruction

    Hongwei Xue, Peng Gao, Hongyang Li, Yu Qiao, Hao Sun, Houqiang Li, and Jiebo Luo. Stare at what you see: Masked image modeling without reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22732–22741, 2023

  42. [42]

    Self-supervised pre-training of swin transformers for 3d medical image analysis

    Yucheng Tang, Dong Yang, Wenqi Li, Holger R Roth, Bennett Landman, Daguang Xu, Vish- wesh Nath, and Ali Hatamizadeh. Self-supervised pre-training of swin transformers for 3d medical image analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20730–20740, 2022. 27

  43. [43]

    J. Huix, A. Ganeshan, J. Haslum, M. Soderberg, C. Matsoukas, and K. Smith. Are natural domain foundation models useful for medical image classification? In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7619–7628, Los Alamitos, CA, USA, jan 2024. IEEE Computer Society

  44. [44]

    Onthechallengesandperspectivesoffoundationmodels for medical image analysis.Medical Image Analysis, 91:102996, 2024

    ShaotingZhangandDimitrisMetaxas. Onthechallengesandperspectivesoffoundationmodels for medical image analysis.Medical Image Analysis, 91:102996, 2024

  45. [45]

    Rethinking super- vised pre-training for better downstream transferring

    Yutong Feng, Jianwen Jiang, Mingqian Tang, Rong Jin, and Yue Gao. Rethinking super- vised pre-training for better downstream transferring. InInternational Conference on Learning Representations, 2022

  46. [46]

    Rethinking pre-training on medical imaging.Journal of Visual Communication and Image Representation, 78:103145, 2021

    Yang Wen, Leiting Chen, Yu Deng, and Chuan Zhou. Rethinking pre-training on medical imaging.Journal of Visual Communication and Image Representation, 78:103145, 2021

  47. [47]

    Transferable visual words: Exploiting the semantics of anatomical pat- terns for self-supervised learning.IEEE transactions on medical imaging, 40(10):2857–2868, 2021

    Fatemeh Haghighi, Mohammad Reza Hosseinzadeh Taher, Zongwei Zhou, Michael B Gotway, and Jianming Liang. Transferable visual words: Exploiting the semantics of anatomical pat- terns for self-supervised learning.IEEE transactions on medical imaging, 40(10):2857–2868, 2021

  48. [48]

    Unimiss: Universal medical self-supervised learning via breaking dimensionality barrier

    Yutong Xie, Jianpeng Zhang, Yong Xia, and Qi Wu. Unimiss: Universal medical self-supervised learning via breaking dimensionality barrier. InEuropean Conference on Computer Vision, pages 558–575. Springer, 2022

  49. [49]

    How well do supervised 3d models transfer to medical imaging tasks?arXiv preprint arXiv:2501.11253, 2025

    Wenxuan Li, Alan Yuille, and Zongwei Zhou. How well do supervised 3d models transfer to medical imaging tasks?arXiv preprint arXiv:2501.11253, 2025

  50. [50]

    Sabuncu, John Guttag, and Adrian V

    Victor Ion Butoi, Jose Javier Gonzalez Ortiz, Tianyu Ma, Mert R. Sabuncu, John Guttag, and Adrian V. Dalca. Universeg:Universal medical image segmentation. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 21381–21394, 2023

  51. [51]

    Segment anything in medical images.Nature Comm, 15(654), 2024

    Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Comm, 15(654), 2024

  52. [52]

    Revisiting mae pre-training for 3d medical image segmentation

    Tassilo Wald, Constantin Ulrich, Stanislav Lukyanenko, Andrei Goncharov, Alberto Paderno, Maximilian Miller, Leander Maerkisch, Paul Jaeger, and Klaus Maier-Hein. Revisiting mae pre-training for 3d medical image segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5186–5196, 2025

  53. [53]

    Roth, and Daguang Xu.UNETR:Transformers for 3Dmedical image segmentation

    Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R. Roth, and Daguang Xu.UNETR:Transformers for 3Dmedical image segmentation. InIEEE/CVF Winter Conf. Applications of Computer Vision, pages 1748–1758, 2022. 28

  54. [54]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  55. [55]

    (2022) AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

    Yuanfeng Ji, Haotian Bai, Jie Yang, Chongjian Ge, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhang, Wanling Ma, Xiang Wan, et al. Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation.arXiv preprint arXiv:2206.08023, 2022

  56. [56]

    Aerts, Rios V

    H. Aerts, Rios V. E., Ralph TH Leijenaar, C. Parmar, P. Grossmann, S. Carvalho, and P. Lam- bin. Data fromNSCLC-radiomics.TheCancerImagingArchive, 2015

  57. [57]

    The liver tumor segmentation benchmark (lits).Medical image analysis, 84:102680, 2023

    Patrick Bilic, Patrick Christ, Hongwei Bran Li, Eugene Vorontsov, Avi Ben-Cohen, Georgios Kaissis, Adi Szeskin, Colin Jacobs, Gabriel Efrain Humpire Mamani, Gabriel Chartrand, et al. The liver tumor segmentation benchmark (lits).Medical image analysis, 84:102680, 2023

  58. [58]

    The KiTS21 Challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct,

    Nicholas Heller, Fabian Isensee, Dasha Trofimova, Resha Tejpaul, Zhongchen Zhao, Huai Chen, Lisheng Wang, Alex Golts, Daniel Khapun, Daniel Shats, et al. The kits21 challenge: Auto- matic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct. arXiv preprint arXiv:2307.01984, 2023

  59. [59]

    InIEEE Int

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.SWINtransformer:Hierarchical vision transformer using shifted windows. InIEEE Int. Conf. Computer Vision, pages 10012–10022, 2021

  60. [60]

    Cotr: Efficiently bridgingCNN and transformer for 3Dmedical image segmentation

    Yutong Xie, Jianpeng Zhang, Chunhua Shen, and Yong Xia. Cotr: Efficiently bridgingCNN and transformer for 3Dmedical image segmentation. InMedical Image Computing and Com- puter Assisted Intervention, pages 171–180, 2021

  61. [61]

    Springer Nature, 2023

    Yiming Xiao, Guanyu Yang, and Shuang Song.Lesion Segmentation in Surgical and Diagnostic Applications: MICCAI 2022 Challenges, CuRIOUS 2022, KiPA 2022 and MELA 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18–22, 2022, Proceedings, volume 13648. Springer Nature, 2023

  62. [62]

    Artificial intelligence for the detection ofCOVID-19 pneumonia on chest ct using multinational datasets.Nature communications, 11(1):4080, 2020

    Stephanie A Harmon, Thomas H Sanford, Sheng Xu, Evrim B Turkbey, Holger Roth, Ziyue Xu, Dong Yang, Andriy Myronenko, Victoria Anderson, Amel Amalou, et al. Artificial intelligence for the detection ofCOVID-19 pneumonia on chest ct using multinational datasets.Nature communications, 11(1):4080, 2020

  63. [63]

    Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation

    Holger R Roth, Le Lu, Amal Farag, Hoo-Chang Shin, Jiamin Liu, Evrim B Turkbey, and Ronald M Summers. Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. InMedical Image Computing and Computer-Assisted Intervention– MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceed- ings, Part I 18, ...

  64. [64]

    Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth

    Thao Nguyen, Maithra Raghu, and Simon Kornblith. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth.arXiv preprint arXiv:2010.15327, 2020

  65. [65]

    Feature selection via dependence maximization.Journal of Machine Learning Research, 13(5), 2012

    Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. Feature selection via dependence maximization.Journal of Machine Learning Research, 13(5), 2012

  66. [66]

    Sam-med3d: towards general-purpose segmentation models for volumetric medical images

    Haoyu Wang, Sizheng Guo, Jin Ye, Zhongying Deng, Junlong Cheng, Tianbin Li, Jianpin Chen, Yanzhou Su, Ziyan Huang, Yiqing Shen, et al. Sam-med3d: towards general-purpose segmentation models for volumetric medical images. InEuropean Conference on Computer Vision, pages 51–67. Springer, 2024. 30