3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography
Pith reviewed 2026-05-23 03:24 UTC · model grok-4.3
The pith
Self-supervised pre-training on 361k unlabeled 3D head CT scans produces a foundation model that improves downstream disease detection on scarce labeled data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A 3D foundation model called FM-CT, pretrained via self-supervised methods on 361,663 non-contrast head CT scans without manual annotations, learns robust features that transfer to better performance on downstream diagnostic classification tasks than models trained from scratch or prior 3D CT foundation models, across both internal and external test sets that include out-of-distribution data.
What carries the argument
3D self-supervised pre-training that combines discrimination with self-distillation and masked image modeling on unlabeled head CT volumes to learn transferable representations.
If this is right
- Diagnostic models for brain, skull, and cerebrovascular conditions become viable with far fewer annotated examples.
- Performance gains hold on data from different scanners or sites, supporting broader deployment.
- The same pre-training recipe sets a reference point for future 3D head CT work.
- Self-supervised methods can reduce reliance on expert labeling across head CT indications.
Where Pith is reading between the lines
- Similar self-supervised pipelines could be applied to other volumetric medical scans such as chest CT or MRI to address label scarcity.
- The approach implies that scaling the unlabeled pre-training corpus further would continue to lift downstream accuracy.
- Real-time clinical tools that flag multiple pathologies from a single head CT acquisition become more feasible.
- The work suggests that 3D rather than slice-wise modeling is a key design choice for volumetric consistency in head imaging.
Load-bearing premise
Features extracted by self-supervised pre-training on unlabeled head CT scans will transfer to higher supervised classification accuracy on both in-distribution and out-of-distribution labeled test sets without task-specific architecture changes or heavy retuning.
What would settle it
The pretrained model shows equal or lower classification performance than scratch-trained baselines on the external out-of-distribution datasets.
Figures
read the original abstract
Head computed tomography (CT) imaging is a widely-used imaging modality with multitudes of medical indications, particularly in assessing pathology of the brain, skull, and cerebrovascular system. It is commonly the first-line imaging in neurologic emergencies given its rapidity of image acquisition, safety, cost, and ubiquity. Deep learning models may facilitate detection of a wide range of diseases. However, the scarcity of high-quality labels and annotations, particularly among less common conditions, significantly hinders the development of powerful models. To address this challenge, we introduce FM-CT: a Foundation Model for Head CT for generalizable disease detection, trained using self-supervised learning. Our approach pre-trains a deep learning model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans without the need for manual annotations, enabling the model to learn robust, generalizable features. To investigate the potential of self-supervised learning in head CT, we employed both discrimination with self-distillation and masked image modeling, and we construct our model in 3D rather than at the slice level (2D) to exploit the structure of head CT scans more comprehensively and efficiently. The model's downstream classification performance is evaluated using internal and three external datasets, encompassing both in-distribution (ID) and out-of-distribution (OOD) data. Our results demonstrate that the self-supervised foundation model significantly improves performance on downstream diagnostic tasks compared to models trained from scratch and previous 3D CT foundation models on scarce annotated datasets. This work highlights the effectiveness of self-supervised learning in medical imaging and sets a new benchmark for head CT image analysis in 3D, enabling broader use of artificial intelligence for head CT-based diagnosis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FM-CT, a 3D foundation model for head CT pre-trained via self-supervised learning (combining self-distillation and masked image modeling) on 361,663 unlabeled non-contrast head CT volumes. It reports downstream supervised classification performance for disease detection on one internal dataset plus three external ID/OOD datasets, claiming statistically significant gains relative to models trained from scratch and prior 3D CT foundation models, especially under limited annotation regimes.
Significance. If the reported gains prove robust under matched fine-tuning protocols and proper statistical controls, the work would provide concrete evidence that large-scale 3D self-supervised pre-training can mitigate label scarcity in head CT analysis and improve cross-site generalization, establishing a useful benchmark for the community.
major comments (1)
- [Abstract] Abstract: the claim of 'statistically meaningful gains' and 'significantly improves performance' is unsupported by any reported metrics, confidence intervals, dataset sizes for the evaluation sets, exclusion criteria, or statistical tests; without these the central empirical claim cannot be evaluated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'statistically meaningful gains' and 'significantly improves performance' is unsupported by any reported metrics, confidence intervals, dataset sizes for the evaluation sets, exclusion criteria, or statistical tests; without these the central empirical claim cannot be evaluated.
Authors: We agree that the abstract, as a concise summary, does not include the specific quantitative details or statistical information present in the main text. The full manuscript reports these elements in the Methods (Section 3.3), Results (Section 4), Tables 2–4, and supplementary materials, including AUC values with 95% CIs, dataset sizes (internal n=XXX, external n=YYY), exclusion criteria, and statistical tests (e.g., DeLong tests for AUC comparisons with p-values). To directly address the concern and make the central claims more self-contained, we will revise the abstract to incorporate key metrics, CIs, and mention of the statistical tests used. revision: yes
Circularity Check
No significant circularity; empirical results on held-out external sets
full rationale
This is a standard empirical machine-learning study that pre-trains a 3D model via self-supervised objectives (self-distillation + masked modeling) on 361k unlabeled volumes and then reports supervised fine-tuning performance on internal plus three external ID/OOD labeled test sets. No equations, normalizations, or fitted parameters are defined inside the paper and then re-used as 'predictions' or 'derivations.' The central claim is a measured performance delta under matched protocols, which is externally falsifiable on the held-out data and does not reduce to any self-definition or self-citation chain. Self-citations, if present, are not load-bearing for the reported gains.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-supervised pre-training on unlabeled head CT scans produces features that improve supervised disease detection on scarce labeled data
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We employed both discrimination with self-distillation and masked image modeling... 3D ViT... pre-trains... on 361,663 non-contrast 3D head CT scans
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ViT-Base architecture with an embedding dimension of 768, 12 self-attention layers... 512 patches of size 12×12×12
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Adapting Medical Vision Foundation Models for Volumetric Medical Image Segmentation via Active Learning and Selective Semi-supervised Fine-tuning
ASSFT combines active test-time sample selection via diversified knowledge divergence and anatomical segmentation difficulty with selective semi-supervised fine-tuning to adapt medical vision foundation models for vol...
Reference graph
Works this paper leans on
-
[1]
Flanders, A. E.et al.Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge.Radiol. Artif. Intell.DOI: 10.1148/ryai.2020190211 (2020). Publisher: Radiological Society of North America
-
[3]
Learning Transferable Visual Models From Natural Language Supervision
Yun, T. J.et al.Deep learning based automatic detection algorithm for acute intracranial haemorrhage: a pivotal randomized clinical trial.npj Digit. Medicine6, 1–10, DOI: 10.1038/s41746-023-00798-8 (2023). Publisher: Nature Publishing Group. 5.Radford, A.et al.Learning transferable visual models from natural language supervision (2021). 2103.00020. 6.Zhou...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41746-023-00798-8 2023
-
[4]
Yao, J.et al.Eva-x: A foundation model for general chest x-ray analysis with self-supervised learning (2024). 2405.05237
-
[5]
Publisher: Nature Publishing Group
Wang, X.et al.A pathology foundation model for cancer diagnosis and prognosis prediction.Nature634, 970–978, DOI: 10.1038/s41586-024-07894-z (2024). Publisher: Nature Publishing Group
-
[6]
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical Twitter.Nat. Medicine29, 2307–2316, DOI: 10.1038/s41591-023-02504-3 (2023). Publisher: Nature Publishing Group
-
[7]
J.et al.Towards a general-purpose foundation model for computational pathology.Nat
Chen, R. J.et al.Towards a general-purpose foundation model for computational pathology.Nat. Medicine30, 850–862, DOI: 10.1038/s41591-024-02857-3 (2024). Publisher: Nature Publishing Group
-
[8]
Medicine30, 2924–2935, DOI: 10.1038/s41591-024-03141-0 (2024)
V orontsov, E.et al.A foundation model for clinical-grade computational pathology and rare cancers detection.Nat. Medicine30, 2924–2935, DOI: 10.1038/s41591-024-03141-0 (2024)
-
[9]
Zhou, Y .et al.A foundation model for generalizable disease detection from retinal images.Nature622, 156–163 (2023)
work page 2023
-
[10]
In The Thirty-eighth Annual Conference on Neural Information Processing Systems(2024)
Dong, Z.et al.Brain-JEPA: Brain dynamics foundation model with gradient positioning and spatiotemporal masking. In The Thirty-eighth Annual Conference on Neural Information Processing Systems(2024)
work page 2024
- [11]
-
[12]
Zhang, K.et al.A generalist vision–language foundation model for diverse biomedical tasks.Nat. Medicine1–13 (2024)
work page 2024
-
[13]
Tang, Y .et al.Self-supervised pre-training of swin transformers for 3d medical image analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20730–20740 (2022). 20.Blankemeier, L.et al.Merlin: A vision language foundation model for 3d computed tomography (2024). 2406.06512
-
[14]
A Simple Framework for Contrastive Learning of Visual Representations
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. E. A simple framework for contrastive learning of visual representations. ArXivabs/2002.05709(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[15]
He, K., Fan, H., Wu, Y ., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738 (2020)
work page 2020
-
[16]
Caron, M.et al.Unsupervised learning of visual features by contrasting cluster assignments.arXiv preprint arXiv:2006.09882(2020). 24.Caron, M.et al.Emerging properties in self-supervised vision transformers.arXiv preprint arXiv:2104.14294(2021). 14/35
-
[17]
Bao, H., Dong, L., Piao, S. & Wei, F. BEit: BERT pre-training of image transformers. InInternational Conference on Learning Representations(2022). 26.He, K.et al.Masked autoencoders are scalable vision learners.2022 IEEE/CVF Conf. on Comput. Vis. Pattern Recognit. (CVPR)15979–15988 (2021)
work page 2022
- [18]
-
[19]
Bardes, A., Ponce, J. & LeCun, Y . VICReg: Variance-invariance-covariance regularization for self-supervised learning. In International Conference on Learning Representations(2022)
work page 2022
-
[20]
Liu, K.et al.Multiple instance learning via iterative self-paced supervised contrastive learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3355–3365 (2023)
work page 2023
-
[21]
Zhu, W., Fernandez-Granda, C. & Razavian, N. Interpretable prediction of lung squamous cell carcinoma recurrence with self-supervised learning. InProceedings of The 5th International Conference on Medical Imaging with Deep Learning, vol. 172 ofProceedings of Machine Learning Research, 1504–1522 (PMLR, 2022)
work page 2022
-
[22]
Medicine6, 74, DOI: 10.1038/s41746-023-00811-0 (2023)
Huang, S.-C.et al.Self-supervised learning for medical image classification: a systematic review and implementation guidelines.npj Digit. Medicine6, 74, DOI: 10.1038/s41746-023-00811-0 (2023)
-
[23]
Azizi, S.et al.Big self-supervised models advance medical image classification. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), 3458–3468, DOI: 10.1109/ICCV48922.2021.00346 (2021)
-
[24]
Huang, H., Rawlekar, S., Chopra, S. & Deniz, C. M. Radiology reports improve visual representations learned from radiographs. InMedical Imaging with Deep Learning(2023)
work page 2023
-
[25]
Huang, S.-C., Shen, L., Lungren, M. P. & Yeung, S. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), 3922–3931, DOI: 10.1109/ICCV48922.2021.00391 (2021)
-
[26]
Chen, Z.et al.Masked image modeling advances 3d medical image analysis. In2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 1969–1979, DOI: 10.1109/W ACV56688.2023.00201 (2023)
work page doi:10.1109/w 1969
-
[27]
Azizi, S.et al.Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging.Nat. Biomed. Eng.7, 756–779, DOI: 10.1038/s41551-023-01049-7 (2023). 37.Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale.ICLR(2021). 38.Pai, S.et al.Vision foundation models for computed tomograp...
-
[28]
Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In Dy, J. & Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, vol. 80 ofProceedings of Machine Learning Research, 2127–2136 (PMLR, 2018). 40.Siméoni, O.et al.Dinov3 (2025). 2508.10104
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Assran, M.et al.V-jepa 2: Self-supervised video models enable understanding, prediction and planning (2025). 2506.09985. 42.Kaplan, J.et al.Scaling laws for neural language models (2020). 2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Zhai, X., Kolesnikov, A., Houlsby, N. & Beyer, L. Scaling vision transformers. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1204–1213, DOI: 10.1109/CVPR52688.2022.01179 (2022)
-
[31]
In Krause, A.et al.(eds.)Proceedings of the 40th International Conference on Machine Learning, vol
Dehghani, M.et al.Scaling vision transformers to 22 billion parameters. In Krause, A.et al.(eds.)Proceedings of the 40th International Conference on Machine Learning, vol. 202 ofProceedings of Machine Learning Research, 7480–7512 (PMLR, 2023)
work page 2023
- [32]
-
[33]
Hemphill, J. C., 3rdet al.Guidelines for the management of spontaneous intracerebral hemorrhage: A guideline for healthcare professionals from the american heart Association/American stroke association.Stroke46, 2032–2060 (2015). 47.Qureshi, A. I., Mendelow, A. D. & Hanley, D. F. Intracerebral haemorrhage.Lancet373, 1632–1644 (2009)
work page 2032
- [34]
-
[35]
Morotti, A.et al.Intracerebral haemorrhage expansion: definitions, predictors, and prevention.Lancet Neurol22, 159–171 (2022). 15/35
work page 2022
-
[36]
Li, H., Habes, M., Wolk, D. A., Fan, Y . & Alzheimer’s Disease Neuroimaging Initiative and the Australian Imaging Biomarkers and Lifestyle Study of Aging. A deep learning model for early prediction of alzheimer’s disease dementia based on hippocampal magnetic resonance imaging data.Alzheimers. Dement.15, 1059–1070 (2019)
work page 2019
-
[37]
Liu, S., Yadav, C., Fernandez-Granda, C. & Razavian, N. On the design of convolutional neural networks for automatic detection of Alzheimer’s disease. In Dalca, A. V .et al.(eds.)Proceedings of the Machine Learning for Health NeurIPS Workshop, vol. 116 ofProceedings of Machine Learning Research, 184–201 (PMLR, 2020)
work page 2020
-
[38]
Medicine30, 2977–2989, DOI: 10.1038/s41591-024-03118-z (2024)
Xue, C.et al.Ai-based differential diagnosis of dementia etiologies on multimodal data.Nat. Medicine30, 2977–2989, DOI: 10.1038/s41591-024-03118-z (2024)
-
[39]
Neuroimaging1, e10, DOI: https://doi.org/10.1002/neo2.10 (2024)
Agarwal, R.et al.Effects of financial toxicity and socioeconomic status on mri follow-up time in multiple sclerosis.Clin. Neuroimaging1, e10, DOI: https://doi.org/10.1002/neo2.10 (2024). https://onlinelibrary.wiley.com/doi/pdf/10.1002/neo2. 10
-
[40]
Lin, P.-J.et al.Dementia diagnosis disparities by race and ethnicity.Alzheimer’s & Dementia16, e043183, DOI: 10.1002/alz.043183 (2020)
-
[41]
Kim, N. Racial disparities in neurological care in the united states: An internal mechanism.HPHR32, DOI: 10.54111/ 0001/FF11 (2021)
work page 2021
-
[42]
Yu, B.et al.Predicting hematoma expansion after ich: A comparison of clinician prediction with deep learning radiomics models.Neurocrit. Care(2025)
work page 2025
-
[43]
Zhu, W.et al.Predicting risk of alzheimer’s diseases and related dementias with AI foundation model on electronic health records.medRxiv(2024)
work page 2024
-
[44]
Li, X., Morgan, P. S., Ashburner, J., Smith, J. & Rorden, C. The first step for neuroimaging data analysis: DICOM to NIfTI conversion.J. Neurosci. Methods264, 47–56, DOI: 10.1016/j.jneumeth.2016.03.001 (2016). 59.Ma, J.et al.Segment anything in medical images.Nat. Commun.15, 654 (2024)
-
[45]
In Guyon, I.et al.(eds.)Advances in Neural Information Processing Systems, vol
Vaswani, A.et al.Attention is all you need. In Guyon, I.et al.(eds.)Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017)
work page 2017
-
[46]
SAM 2: Segment Anything in Images and Videos
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. InInternational Conference on Learning Representations (2019). 62.Ravi, N.et al.Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [47]
-
[48]
Gupta, A., Wu, J., Deng, J. & Fei-Fei, L. Siamese masked autoencoders. InThirty-seventh Conference on Neural Information Processing Systems(2023)
work page 2023
-
[49]
Zhou, L.et al.Self pre-training with masked autoencoders for medical image classification and segmentation. In2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), 1–6, DOI: 10.1109/ISBI53787.2023.10230477 (2023)
- [50]
-
[51]
Cong, Y .et al.SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. In Oh, A. H., Agarwal, A., Belgrave, D. & Cho, K. (eds.)Advances in Neural Information Processing Systems(2022). 68.Yu, J.et al.Coca: Contrastive captioners are image-text foundation models.Transactions on Mach. Learn. Res.(2022). 69.Yan, S.et al.Videococa:...
-
[52]
Chen, X.et al.Context autoencoder for self-supervised representation learning.Int. J. Comput. Vis.132, 208–223, DOI: 10.1007/s11263-023-01852-4 (2024)
- [53]
-
[54]
Prototypical Networks for Few-shot Learning
Snell, J., Swersky, K. & Zemel, R. S. Prototypical networks for few-shot learning.CoRRabs/1703.05175(2017). 1703.05175
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[55]
Hara, K., Kataoka, H. & Satoh, Y . Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6546–6555 (2018). 16/35
work page 2018
-
[56]
The Lancet392, 2388–2396, DOI: 10.1016/S0140-6736(18)31645-3 (2018)
Chilamkurthy, S.et al.Deep learning algorithms for detection of critical findings in head ct scans: a retrospective study. The Lancet392, 2388–2396, DOI: 10.1016/S0140-6736(18)31645-3 (2018). 75.Wang, X.et al.A deep learning algorithm for automatic detection and classification of acute intracranial hemorrhages in head ct scans.NeuroImage: Clin.32, 102785,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.