pith. sign in

arxiv: 2502.02779 · v3 · submitted 2025-02-04 · 💻 cs.CV · cs.AI

3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography

Pith reviewed 2026-05-23 03:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords foundation modelself-supervised learninghead CT3D medical imagingdisease detectiongeneralizable featurescomputed tomographyout-of-distribution generalization
0
0 comments X

The pith

Self-supervised pre-training on 361k unlabeled 3D head CT scans produces a foundation model that improves downstream disease detection on scarce labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that self-supervised learning on a large collection of unlabeled head CT volumes can create generalizable features for detecting multiple pathologies. This would matter because high-quality labels are scarce in medical imaging, especially for less common conditions, limiting supervised model development. The approach uses both discrimination with self-distillation and masked image modeling, built entirely in 3D to capture volumetric structure, then evaluates transfer to classification on internal and three external datasets spanning in-distribution and out-of-distribution cases. Results indicate the resulting model beats both training from scratch and earlier 3D CT foundation models when annotations are limited.

Core claim

A 3D foundation model called FM-CT, pretrained via self-supervised methods on 361,663 non-contrast head CT scans without manual annotations, learns robust features that transfer to better performance on downstream diagnostic classification tasks than models trained from scratch or prior 3D CT foundation models, across both internal and external test sets that include out-of-distribution data.

What carries the argument

3D self-supervised pre-training that combines discrimination with self-distillation and masked image modeling on unlabeled head CT volumes to learn transferable representations.

If this is right

  • Diagnostic models for brain, skull, and cerebrovascular conditions become viable with far fewer annotated examples.
  • Performance gains hold on data from different scanners or sites, supporting broader deployment.
  • The same pre-training recipe sets a reference point for future 3D head CT work.
  • Self-supervised methods can reduce reliance on expert labeling across head CT indications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar self-supervised pipelines could be applied to other volumetric medical scans such as chest CT or MRI to address label scarcity.
  • The approach implies that scaling the unlabeled pre-training corpus further would continue to lift downstream accuracy.
  • Real-time clinical tools that flag multiple pathologies from a single head CT acquisition become more feasible.
  • The work suggests that 3D rather than slice-wise modeling is a key design choice for volumetric consistency in head imaging.

Load-bearing premise

Features extracted by self-supervised pre-training on unlabeled head CT scans will transfer to higher supervised classification accuracy on both in-distribution and out-of-distribution labeled test sets without task-specific architecture changes or heavy retuning.

What would settle it

The pretrained model shows equal or lower classification performance than scratch-trained baselines on the external out-of-distribution datasets.

Figures

Figures reproduced from arXiv: 2502.02779 by Arjun V. Masurkar, Boyang Yu, Emilio Vega, Haoxu Huang, Huanze Tang, Jennifer A. Frontera, Kara Melmed, Long Chen, Narges Razavian, Rushabh Musthyala, Seena Dehkharghani, Thomas O'Donnell, Weicheng Zhu.

Figure 1
Figure 1. Figure 1: Overview of the study - the approach to developing a foundation model for head CT and its performance in disease detection tasks. n in the Figure refers to number of samples for each dataset. a, Collection of training data and pretraining of the foundation model. b, Query disease labels associated with head CT scans for downstream tasks. c, Evaluation design of the foundation model using both internal and … view at source ↗
Figure 2
Figure 2. Figure 2: Few-shot performance of the foundation model. The plots display the per-pathology AUC and average precision (AP) of the disease detection model under a few-shot learning setting, evaluated with varying numbers of training samples from the NYU Langone, NYU Long Island, and RSNA datasets. CQ500 is excluded since its small dataset size gives no enough positive samples for many diseases. Few-shot learning perf… view at source ↗
Figure 3
Figure 3. Figure 3: Volume-to-Volume Retrieval Performance Comparison The plot presents mean Average-Precision (Retrieval mAP) for volume-to-volume retrieval with hemorrhage sub-types retrieval on RSNA and CQ500. All vs. all image retrieval is performed in this study, where every image in the dataset is used as a query once, and the gallery (the search space) is the entire dataset itself. Additional methodological details are… view at source ↗
Figure 4
Figure 4. Figure 4: Performance for Different Percentage of Pre-training Samples (Mean): we compare the label efficiency in terms of different percentage of pre-training data for MAE vs. DINO. The 95% CI are plotted in colour bands and the centre points of the bands indicate the mean value. We show that although DINO present higher label efficiency plot, both MAE and DINO efficiently scale up on downstream performance as more… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of ViT attentions on the scan. Visual Interpretation To gain insight into the features learned through self-supervised pre-training and supervised fine-tuning of the foundation model, we visualize the attention maps within the Vision Transformer (ViT), as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Head computed tomography (CT) imaging is a widely-used imaging modality with multitudes of medical indications, particularly in assessing pathology of the brain, skull, and cerebrovascular system. It is commonly the first-line imaging in neurologic emergencies given its rapidity of image acquisition, safety, cost, and ubiquity. Deep learning models may facilitate detection of a wide range of diseases. However, the scarcity of high-quality labels and annotations, particularly among less common conditions, significantly hinders the development of powerful models. To address this challenge, we introduce FM-CT: a Foundation Model for Head CT for generalizable disease detection, trained using self-supervised learning. Our approach pre-trains a deep learning model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans without the need for manual annotations, enabling the model to learn robust, generalizable features. To investigate the potential of self-supervised learning in head CT, we employed both discrimination with self-distillation and masked image modeling, and we construct our model in 3D rather than at the slice level (2D) to exploit the structure of head CT scans more comprehensively and efficiently. The model's downstream classification performance is evaluated using internal and three external datasets, encompassing both in-distribution (ID) and out-of-distribution (OOD) data. Our results demonstrate that the self-supervised foundation model significantly improves performance on downstream diagnostic tasks compared to models trained from scratch and previous 3D CT foundation models on scarce annotated datasets. This work highlights the effectiveness of self-supervised learning in medical imaging and sets a new benchmark for head CT image analysis in 3D, enabling broader use of artificial intelligence for head CT-based diagnosis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces FM-CT, a 3D foundation model for head CT pre-trained via self-supervised learning (combining self-distillation and masked image modeling) on 361,663 unlabeled non-contrast head CT volumes. It reports downstream supervised classification performance for disease detection on one internal dataset plus three external ID/OOD datasets, claiming statistically significant gains relative to models trained from scratch and prior 3D CT foundation models, especially under limited annotation regimes.

Significance. If the reported gains prove robust under matched fine-tuning protocols and proper statistical controls, the work would provide concrete evidence that large-scale 3D self-supervised pre-training can mitigate label scarcity in head CT analysis and improve cross-site generalization, establishing a useful benchmark for the community.

major comments (1)
  1. [Abstract] Abstract: the claim of 'statistically meaningful gains' and 'significantly improves performance' is unsupported by any reported metrics, confidence intervals, dataset sizes for the evaluation sets, exclusion criteria, or statistical tests; without these the central empirical claim cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'statistically meaningful gains' and 'significantly improves performance' is unsupported by any reported metrics, confidence intervals, dataset sizes for the evaluation sets, exclusion criteria, or statistical tests; without these the central empirical claim cannot be evaluated.

    Authors: We agree that the abstract, as a concise summary, does not include the specific quantitative details or statistical information present in the main text. The full manuscript reports these elements in the Methods (Section 3.3), Results (Section 4), Tables 2–4, and supplementary materials, including AUC values with 95% CIs, dataset sizes (internal n=XXX, external n=YYY), exclusion criteria, and statistical tests (e.g., DeLong tests for AUC comparisons with p-values). To directly address the concern and make the central claims more self-contained, we will revise the abstract to incorporate key metrics, CIs, and mention of the statistical tests used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on held-out external sets

full rationale

This is a standard empirical machine-learning study that pre-trains a 3D model via self-supervised objectives (self-distillation + masked modeling) on 361k unlabeled volumes and then reports supervised fine-tuning performance on internal plus three external ID/OOD labeled test sets. No equations, normalizations, or fitted parameters are defined inside the paper and then re-used as 'predictions' or 'derivations.' The central claim is a measured performance delta under matched protocols, which is externally falsifiable on the held-out data and does not reduce to any self-definition or self-citation chain. Self-citations, if present, are not load-bearing for the reported gains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that self-supervised objectives on unlabeled 3D CT produce transferable features; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Self-supervised pre-training on unlabeled head CT scans produces features that improve supervised disease detection on scarce labeled data
    Invoked in the abstract as the justification for the pre-training strategy and downstream gains.

pith-pipeline@v0.9.0 · 5889 in / 1209 out tokens · 34873 ms · 2026-05-23T03:24:21.153296+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adapting Medical Vision Foundation Models for Volumetric Medical Image Segmentation via Active Learning and Selective Semi-supervised Fine-tuning

    eess.IV 2025-09 unverdicted novelty 5.0

    ASSFT combines active test-time sample selection via diversified knowledge divergence and anatomical segmentation difficulty with selective semi-supervised fine-tuning to adapt medical vision foundation models for vol...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    E.et al.Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge.Radiol

    Flanders, A. E.et al.Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge.Radiol. Artif. Intell.DOI: 10.1148/ryai.2020190211 (2020). Publisher: Radiological Society of North America

  2. [3]

    Learning Transferable Visual Models From Natural Language Supervision

    Yun, T. J.et al.Deep learning based automatic detection algorithm for acute intracranial haemorrhage: a pivotal randomized clinical trial.npj Digit. Medicine6, 1–10, DOI: 10.1038/s41746-023-00798-8 (2023). Publisher: Nature Publishing Group. 5.Radford, A.et al.Learning transferable visual models from natural language supervision (2021). 2103.00020. 6.Zhou...

  3. [4]

    2405.05237

    Yao, J.et al.Eva-x: A foundation model for general chest x-ray analysis with self-supervised learning (2024). 2405.05237

  4. [5]

    Publisher: Nature Publishing Group

    Wang, X.et al.A pathology foundation model for cancer diagnosis and prognosis prediction.Nature634, 970–978, DOI: 10.1038/s41586-024-07894-z (2024). Publisher: Nature Publishing Group

  5. [6]

    Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical Twitter.Nat. Medicine29, 2307–2316, DOI: 10.1038/s41591-023-02504-3 (2023). Publisher: Nature Publishing Group

  6. [7]

    J.et al.Towards a general-purpose foundation model for computational pathology.Nat

    Chen, R. J.et al.Towards a general-purpose foundation model for computational pathology.Nat. Medicine30, 850–862, DOI: 10.1038/s41591-024-02857-3 (2024). Publisher: Nature Publishing Group

  7. [8]

    Medicine30, 2924–2935, DOI: 10.1038/s41591-024-03141-0 (2024)

    V orontsov, E.et al.A foundation model for clinical-grade computational pathology and rare cancers detection.Nat. Medicine30, 2924–2935, DOI: 10.1038/s41591-024-03141-0 (2024)

  8. [9]

    Zhou, Y .et al.A foundation model for generalizable disease detection from retinal images.Nature622, 156–163 (2023)

  9. [10]

    In The Thirty-eighth Annual Conference on Neural Information Processing Systems(2024)

    Dong, Z.et al.Brain-JEPA: Brain dynamics foundation model with gradient positioning and spatiotemporal masking. In The Thirty-eighth Annual Conference on Neural Information Processing Systems(2024)

  10. [11]

    Codella, N. C. F.et al.Medimageinsight: An open-source embedding model for general domain medical imaging (2024). 2410.06542. 17.Yang, L.et al.Advancing multimodal medical capabilities of gemini (2024). 2405.03162

  11. [12]

    Medicine1–13 (2024)

    Zhang, K.et al.A generalist vision–language foundation model for diverse biomedical tasks.Nat. Medicine1–13 (2024)

  12. [13]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20730–20740 (2022)

    Tang, Y .et al.Self-supervised pre-training of swin transformers for 3d medical image analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20730–20740 (2022). 20.Blankemeier, L.et al.Merlin: A vision language foundation model for 3d computed tomography (2024). 2406.06512

  13. [14]

    A Simple Framework for Contrastive Learning of Visual Representations

    Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. E. A simple framework for contrastive learning of visual representations. ArXivabs/2002.05709(2020)

  14. [15]

    & Girshick, R

    He, K., Fan, H., Wu, Y ., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738 (2020)

  15. [16]

    24.Caron, M.et al.Emerging properties in self-supervised vision transformers.arXiv preprint arXiv:2104.14294(2021)

    Caron, M.et al.Unsupervised learning of visual features by contrasting cluster assignments.arXiv preprint arXiv:2006.09882(2020). 24.Caron, M.et al.Emerging properties in self-supervised vision transformers.arXiv preprint arXiv:2104.14294(2021). 14/35

  16. [17]

    & Wei, F

    Bao, H., Dong, L., Piao, S. & Wei, F. BEit: BERT pre-training of image transformers. InInternational Conference on Learning Representations(2022). 26.He, K.et al.Masked autoencoders are scalable vision learners.2022 IEEE/CVF Conf. on Comput. Vis. Pattern Recognit. (CVPR)15979–15988 (2021)

  17. [18]

    & Deny, S

    Zbontar, J., Jing, L., Misra, I., LeCun, Y . & Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230(2021)

  18. [19]

    & LeCun, Y

    Bardes, A., Ponce, J. & LeCun, Y . VICReg: Variance-invariance-covariance regularization for self-supervised learning. In International Conference on Learning Representations(2022)

  19. [20]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3355–3365 (2023)

    Liu, K.et al.Multiple instance learning via iterative self-paced supervised contrastive learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3355–3365 (2023)

  20. [21]

    & Razavian, N

    Zhu, W., Fernandez-Granda, C. & Razavian, N. Interpretable prediction of lung squamous cell carcinoma recurrence with self-supervised learning. InProceedings of The 5th International Conference on Medical Imaging with Deep Learning, vol. 172 ofProceedings of Machine Learning Research, 1504–1522 (PMLR, 2022)

  21. [22]

    Medicine6, 74, DOI: 10.1038/s41746-023-00811-0 (2023)

    Huang, S.-C.et al.Self-supervised learning for medical image classification: a systematic review and implementation guidelines.npj Digit. Medicine6, 74, DOI: 10.1038/s41746-023-00811-0 (2023)

  22. [23]

    In2021 IEEE/CVF International Conference on Computer Vision (ICCV), 3458–3468, DOI: 10.1109/ICCV48922.2021.00346 (2021)

    Azizi, S.et al.Big self-supervised models advance medical image classification. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), 3458–3468, DOI: 10.1109/ICCV48922.2021.00346 (2021)

  23. [24]

    & Deniz, C

    Huang, H., Rawlekar, S., Chopra, S. & Deniz, C. M. Radiology reports improve visual representations learned from radiographs. InMedical Imaging with Deep Learning(2023)

  24. [25]

    Huang, S.-C., Shen, L., Lungren, M. P. & Yeung, S. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), 3922–3931, DOI: 10.1109/ICCV48922.2021.00391 (2021)

  25. [26]

    In2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 1969–1979, DOI: 10.1109/W ACV56688.2023.00201 (2023)

    Chen, Z.et al.Masked image modeling advances 3d medical image analysis. In2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 1969–1979, DOI: 10.1109/W ACV56688.2023.00201 (2023)

  26. [27]

    Azizi, S.et al.Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging.Nat. Biomed. Eng.7, 756–779, DOI: 10.1038/s41551-023-01049-7 (2023). 37.Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale.ICLR(2021). 38.Pai, S.et al.Vision foundation models for computed tomograp...

  27. [28]

    DINOv3

    Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In Dy, J. & Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, vol. 80 ofProceedings of Machine Learning Research, 2127–2136 (PMLR, 2018). 40.Siméoni, O.et al.Dinov3 (2025). 2508.10104

  28. [29]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M.et al.V-jepa 2: Self-supervised video models enable understanding, prediction and planning (2025). 2506.09985. 42.Kaplan, J.et al.Scaling laws for neural language models (2020). 2001.08361

  29. [30]

    & Beyer, L

    Zhai, X., Kolesnikov, A., Houlsby, N. & Beyer, L. Scaling vision transformers. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1204–1213, DOI: 10.1109/CVPR52688.2022.01179 (2022)

  30. [31]

    In Krause, A.et al.(eds.)Proceedings of the 40th International Conference on Machine Learning, vol

    Dehghani, M.et al.Scaling vision transformers to 22 billion parameters. In Krause, A.et al.(eds.)Proceedings of the 40th International Conference on Machine Learning, vol. 202 ofProceedings of Machine Learning Research, 7480–7512 (PMLR, 2023)

  31. [32]

    & Zhou, Z

    Li, W., Yuille, A. & Zhou, Z. How well do supervised models transfer to 3d image segmentation. InThe Twelfth International Conference on Learning Representations, vol. 1 (2024)

  32. [33]

    Hemphill, J. C., 3rdet al.Guidelines for the management of spontaneous intracerebral hemorrhage: A guideline for healthcare professionals from the american heart Association/American stroke association.Stroke46, 2032–2060 (2015). 47.Qureshi, A. I., Mendelow, A. D. & Hanley, D. F. Intracerebral haemorrhage.Lancet373, 1632–1644 (2009)

  33. [34]

    & Caso, V

    Macellari, F., Paciaroni, M., Agnelli, G. & Caso, V . Neuroimaging in intracerebral hemorrhage.Stroke45, 903–908 (2014)

  34. [35]

    Morotti, A.et al.Intracerebral haemorrhage expansion: definitions, predictors, and prevention.Lancet Neurol22, 159–171 (2022). 15/35

  35. [36]

    A., Fan, Y

    Li, H., Habes, M., Wolk, D. A., Fan, Y . & Alzheimer’s Disease Neuroimaging Initiative and the Australian Imaging Biomarkers and Lifestyle Study of Aging. A deep learning model for early prediction of alzheimer’s disease dementia based on hippocampal magnetic resonance imaging data.Alzheimers. Dement.15, 1059–1070 (2019)

  36. [37]

    & Razavian, N

    Liu, S., Yadav, C., Fernandez-Granda, C. & Razavian, N. On the design of convolutional neural networks for automatic detection of Alzheimer’s disease. In Dalca, A. V .et al.(eds.)Proceedings of the Machine Learning for Health NeurIPS Workshop, vol. 116 ofProceedings of Machine Learning Research, 184–201 (PMLR, 2020)

  37. [38]

    Medicine30, 2977–2989, DOI: 10.1038/s41591-024-03118-z (2024)

    Xue, C.et al.Ai-based differential diagnosis of dementia etiologies on multimodal data.Nat. Medicine30, 2977–2989, DOI: 10.1038/s41591-024-03118-z (2024)

  38. [39]

    Neuroimaging1, e10, DOI: https://doi.org/10.1002/neo2.10 (2024)

    Agarwal, R.et al.Effects of financial toxicity and socioeconomic status on mri follow-up time in multiple sclerosis.Clin. Neuroimaging1, e10, DOI: https://doi.org/10.1002/neo2.10 (2024). https://onlinelibrary.wiley.com/doi/pdf/10.1002/neo2. 10

  39. [40]

    Lin, P.-J.et al.Dementia diagnosis disparities by race and ethnicity.Alzheimer’s & Dementia16, e043183, DOI: 10.1002/alz.043183 (2020)

  40. [41]

    Racial disparities in neurological care in the united states: An internal mechanism.HPHR32, DOI: 10.54111/ 0001/FF11 (2021)

    Kim, N. Racial disparities in neurological care in the united states: An internal mechanism.HPHR32, DOI: 10.54111/ 0001/FF11 (2021)

  41. [42]

    Care(2025)

    Yu, B.et al.Predicting hematoma expansion after ich: A comparison of clinician prediction with deep learning radiomics models.Neurocrit. Care(2025)

  42. [43]

    Zhu, W.et al.Predicting risk of alzheimer’s diseases and related dementias with AI foundation model on electronic health records.medRxiv(2024)

  43. [44]

    S., Ashburner, J., Smith, J

    Li, X., Morgan, P. S., Ashburner, J., Smith, J. & Rorden, C. The first step for neuroimaging data analysis: DICOM to NIfTI conversion.J. Neurosci. Methods264, 47–56, DOI: 10.1016/j.jneumeth.2016.03.001 (2016). 59.Ma, J.et al.Segment anything in medical images.Nat. Commun.15, 654 (2024)

  44. [45]

    In Guyon, I.et al.(eds.)Advances in Neural Information Processing Systems, vol

    Vaswani, A.et al.Attention is all you need. In Guyon, I.et al.(eds.)Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017)

  45. [46]

    SAM 2: Segment Anything in Images and Videos

    Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. InInternational Conference on Learning Representations (2019). 62.Ravi, N.et al.Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714(2024)

  46. [47]

    & Wang, L

    Tong, Z., Song, Y ., Wang, J. & Wang, L. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InAdvances in Neural Information Processing Systems(2022)

  47. [48]

    & Fei-Fei, L

    Gupta, A., Wu, J., Deng, J. & Fei-Fei, L. Siamese masked autoencoders. InThirty-seventh Conference on Neural Information Processing Systems(2023)

  48. [49]

    In2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), 1–6, DOI: 10.1109/ISBI53787.2023.10230477 (2023)

    Zhou, L.et al.Self pre-training with masked autoencoders for medical image classification and segmentation. In2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), 1–6, DOI: 10.1109/ISBI53787.2023.10230477 (2023)

  49. [50]

    In Oh, A

    Huang, P.-Y .et al.Masked autoencoders that listen. In Oh, A. H., Agarwal, A., Belgrave, D. & Cho, K. (eds.)Advances in Neural Information Processing Systems(2022)

  50. [51]

    In Oh, A

    Cong, Y .et al.SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. In Oh, A. H., Agarwal, A., Belgrave, D. & Cho, K. (eds.)Advances in Neural Information Processing Systems(2022). 68.Yu, J.et al.Coca: Contrastive captioners are image-text foundation models.Transactions on Mach. Learn. Res.(2022). 69.Yan, S.et al.Videococa:...

  51. [52]

    Chen, X.et al.Context autoencoder for self-supervised representation learning.Int. J. Comput. Vis.132, 208–223, DOI: 10.1007/s11263-023-01852-4 (2024)

  52. [53]

    Wang, Y ., Chao, W.-L., Weinberger, K. Q. & van der Maaten, L. Simpleshot: Revisiting nearest-neighbor classification for few-shot learning.arXiv preprint arXiv:1911.04623(2019)

  53. [54]

    Prototypical Networks for Few-shot Learning

    Snell, J., Swersky, K. & Zemel, R. S. Prototypical networks for few-shot learning.CoRRabs/1703.05175(2017). 1703.05175

  54. [55]

    & Satoh, Y

    Hara, K., Kataoka, H. & Satoh, Y . Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6546–6555 (2018). 16/35

  55. [56]

    The Lancet392, 2388–2396, DOI: 10.1016/S0140-6736(18)31645-3 (2018)

    Chilamkurthy, S.et al.Deep learning algorithms for detection of critical findings in head ct scans: a retrospective study. The Lancet392, 2388–2396, DOI: 10.1016/S0140-6736(18)31645-3 (2018). 75.Wang, X.et al.A deep learning algorithm for automatic detection and classification of acute intracranial hemorrhages in head ct scans.NeuroImage: Clin.32, 102785,...