pith. sign in

arxiv: 2605.13059 · v1 · pith:IW2PKIDAnew · submitted 2026-05-13 · 💻 cs.CV

BrainAnytime: Anatomy-Aware Cross-Modal Pretraining for Brain Image Analysis with Arbitrary Modality Availability

Pith reviewed 2026-05-14 20:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords brain imagingcross-modal pretrainingmasked autoencoderarbitrary modalitiesAlzheimer classificationMRI PET fusionmedical image analysismissing modality
0
0 comments X

The pith

A single pretrained model analyzes brain images using whatever MRI or PET scans are available at the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work develops a pretraining approach that trains one network on tens of thousands of 3D brain scans so it can later accept any subset of standard MRI sequences plus amyloid-PET without retraining. By aligning structural and molecular signals through cross-modal distillation and by masking disease-vulnerable anatomy first during training, the model learns representations that remain useful when some modalities are absent. Experiments across four clinical tasks and five realistic modality combinations show consistent gains over both single-modality models and existing missing-data baselines, with the largest lifts on Alzheimer's and mild cognitive impairment detection. The central idea is that a shared encoder can internalize the typical escalation pathway clinicians already follow, turning incomplete acquisitions into an advantage rather than a liability.

Core claim

BrainAnytime is a single 3D masked autoencoder pretrained on 34,899 scans that uses cross-modal distillation to transfer information between MRI and PET and atlas-guided curriculum masking to emphasize regions prone to neurodegeneration, allowing the same weights to be applied to any combination of available sequences from a lone T1 scan up to a full multimodal workup.

What carries the argument

Multi-MAE3D shared encoder that performs anatomy-aware cross-modal distillation (RCMD) between MRI and PET together with atlas-guided curriculum masking (PACM) to prioritize disease-vulnerable structures during pretraining.

If this is right

  • A hospital can deploy one network across scanners that routinely omit different sequences without building separate models for each protocol.
  • Performance on CN-versus-AD and CN-versus-MCI tasks improves by roughly 6-7 percent relative to prior missing-modality methods even when only partial imaging is supplied.
  • The same pretrained weights support both routine structural diagnosis and molecular confirmation tasks without modality-specific fine-tuning.
  • Curriculum masking focused on atlas-defined vulnerable anatomy transfers disease-relevant features more effectively than uniform random masking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pretraining recipe could be applied to other organ systems where imaging protocols vary by site and where certain sequences are routinely skipped.
  • Longitudinal scans of the same patient could be fed through the model at different time points to track progression without requiring identical modality sets each visit.
  • If the learned correspondences generalize, the framework could serve as a starting point for adding emerging modalities such as tau-PET without full retraining.

Load-bearing premise

The structural-molecular relationships captured from the five chosen datasets will continue to hold for previously unseen modality combinations and for patient groups outside the original training populations.

What would settle it

Evaluation on an external cohort that supplies only T2-FLAIR without T1 or PET, checking whether classification accuracy on CN versus AD falls below that of a T2-FLAIR-only model trained from scratch.

Figures

Figures reproduced from arXiv: 2605.13059 by Guangqian Yang, Qian Niu, Shujun Wang, Tong Ding, Wenlong Hou, Ye Du, Yue Xun.

Figure 1
Figure 1. Figure 1: Overall framework of BrainAnytime. Multi-MAE3D, a 3D multi-modal masked autoencoder with a shared Trans￾former encoder for any modality subset; Reciprocal Cross-Modal Distilla￾tion (RCMD), an EMA-teacher distillation objective that aligns MRI and PET representations; and Pathology-Aware Curriculum Masking (PACM), an atlas-guided curriculum masking strategy that emphasizes AD-relevant neu￾roanatomy during r… view at source ↗
Figure 2
Figure 2. Figure 2: Modality robustness analysis across four downstream tasks under simu [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Label efficiency analysis of BrainAnytime with 10%, 20%, 50%, 80%, and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Clinical diagnostic workups typically follow a modality escalation pathway: after initial clinical evaluation, clinicians begin with routine structural imaging (e.g., MRI), selectively add sequences such as FLAIR or T2 to refine the differential, and reserve molecular imaging (e.g., amyloid-PET) for cases that remain uncertain after standard evaluation. Consequently, patients are observed with heterogeneous and often incomplete modality subsets. However, most current AI models assume fixed data modalities as the model inputs. In this paper, we present BrainAnytime, a unified pretraining framework pretrained on 34,899 3D brain scans from five datasets that support brain image analysis under arbitrary modality availability spanning multi-sequence MRI and amyloid-PET. A single model accepts whatever imaging is available, from a lone T1 scan to a full multimodal workup. Pretraining learns structural-molecular correspondences between MRI and PET via cross-modal distillation (RCMD) and prioritizes disease-vulnerable anatomy via atlas-guided curriculum masking (PACM), all within a shared 3D masked autoencoder (Multi-MAE3D). Across four downstream tasks and five clinically motivated modality settings, BrainAnytime largely outperforms modality-specific models, missing-modality baselines, and large-scale brain MRI pretrained foundation models on most modality settings. Notably, it surpasses the strongest missing-modality baselines with relative improvements of 6.2% and 7.0% in average accuracy on CN vs. AD and CN vs. MCI classification, respectively. Code is available at https://github.com/SDH-Lab/BrainAnytime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents BrainAnytime, a unified pretraining framework for brain image analysis under arbitrary modality availability. It pretrains a shared 3D masked autoencoder (Multi-MAE3D) on 34,899 scans from five datasets using RCMD cross-modal distillation to learn MRI-PET correspondences and PACM atlas-guided curriculum masking to prioritize disease-vulnerable anatomy. A single model is claimed to accept inputs ranging from lone T1 to full multimodal sets. Across four downstream tasks and five clinically motivated modality settings, it outperforms modality-specific models, missing-modality baselines, and large-scale MRI foundation models, with relative accuracy gains of 6.2% on CN vs. AD and 7.0% on CN vs. MCI.

Significance. If the generalization claims hold, the work would be significant for clinical applications where imaging modalities are heterogeneous and incomplete, reducing the need for modality-specific models. The scale of pretraining data and public code release strengthen the contribution. However, the practical impact is limited by the narrow range of tested modality combinations.

major comments (2)
  1. Experiments section: The central claim of support for arbitrary modality availability (from lone T1 to full multimodal) is not load-bearing, as all quantitative results are restricted to five clinically motivated modality settings. No evaluations are reported for unseen combinations such as isolated FLAIR, T2+PET without T1, or PET-only, leaving the generalization to arbitrary unseen subsets untested.
  2. Results section: The reported outperformance and relative improvements (6.2% and 7.0%) lack statistical significance tests, confidence intervals, or ablation controls on the pretraining components (RCMD and PACM), making it unclear whether gains are robust or dataset-specific.
minor comments (2)
  1. Abstract and methods: The exact five modality settings used in evaluation should be enumerated explicitly for reproducibility.
  2. Notation: The acronyms RCMD and PACM are introduced without sufficient expansion or pseudocode in the main text, which could aid clarity for readers unfamiliar with the distillation and masking strategies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and recognition of the potential clinical significance of our work. We address each major comment below and describe the revisions we will implement.

read point-by-point responses
  1. Referee: Experiments section: The central claim of support for arbitrary modality availability (from lone T1 to full multimodal) is not load-bearing, as all quantitative results are restricted to five clinically motivated modality settings. No evaluations are reported for unseen combinations such as isolated FLAIR, T2+PET without T1, or PET-only, leaving the generalization to arbitrary unseen subsets untested.

    Authors: We agree that the current quantitative results focus on five clinically motivated settings and that testing additional unseen combinations would more strongly support the arbitrary availability claim. In the revised manuscript, we will add evaluations for isolated FLAIR, T2+PET without T1, and PET-only inputs in the Experiments section, using the same downstream tasks to demonstrate generalization. revision: yes

  2. Referee: Results section: The reported outperformance and relative improvements (6.2% and 7.0%) lack statistical significance tests, confidence intervals, or ablation controls on the pretraining components (RCMD and PACM), making it unclear whether gains are robust or dataset-specific.

    Authors: We acknowledge that statistical tests, confidence intervals, and component ablations are needed to establish robustness. We will add paired statistical significance tests and 95% confidence intervals for the reported accuracy improvements. We will also include ablation experiments that isolate the contributions of RCMD and PACM, reporting results across the datasets to confirm the gains are not dataset-specific. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical downstream evaluations are independent of pretraining definitions

full rationale

The paper's central claims consist of measured accuracy improvements on four downstream tasks across five modality settings using held-out data from the pretraining corpus. These results are obtained by standard fine-tuning and evaluation protocols rather than by any equation that reduces a reported prediction to a fitted parameter or self-defined quantity inside the same model. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the architecture or loss terms in a load-bearing way; the RCMD and PACM components are presented as design choices whose value is assessed by external task performance. The arbitrary-modality claim is an extrapolation from the tested settings, but this is a generalization question rather than a circular reduction of the reported numbers to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard masked-autoencoder assumptions plus two domain-specific design choices whose justification is not visible in the abstract.

axioms (2)
  • domain assumption Cross-modal correspondences learned via distillation on the training distribution transfer to arbitrary missing-modality test cases.
    Invoked implicitly by the claim that one model works for any modality subset.
  • domain assumption Atlas-guided masking prioritizes disease-vulnerable regions without introducing selection bias on downstream tasks.
    Stated as part of the PACM component.

pith-pipeline@v0.9.0 · 5603 in / 1401 out tokens · 33644 ms · 2026-05-14T20:24:01.313913+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Alzheimer Disease & Associated Disorders21, 249–258 (2007)

    Beekly, D.L., Ramos, E.M., Lee, W.W., Deitrich, W.D., Jacka, M.E., Wu, J., Hub- bard, J.L., Koepsell, T.D., Morris, J.C., Kukull, W.A.: The national Alzheimer’s coordinating center (NACC) database: The uniform data set. Alzheimer Disease & Associated Disorders21, 249–258 (2007)

  2. [2]

    Acta Neuropathologica112, 389 – 404 (2006)

    Braak, H., Alafuzoff, I., Arzberger, T., Kretzschmar, H.A., Tredici, K.D.: Staging of Alzheimer disease-associated neurofibrillary pathology using paraffin sections and immunocytochemistry. Acta Neuropathologica112, 389 – 404 (2006)

  3. [3]

    Alzheimer’s & Dementia21(2024)

    Chen, K., Weng, Y., you Huang, Y., Zhang, Y., Dening, T., et al.: A multi-view learning approach with diffusion model to synthesize FDG PET from MRI T1WI for diagnosis of Alzheimer’s disease. Alzheimer’s & Dementia21(2024)

  4. [4]

    The Lancet Neurology19(11), 951–962 (2020)

    Ch´ etelat, G., Arbizu, J., Barthel, H., Garibotto, V., Law, I., Morbelli, S., Van De Giessen, E., Agosta, F., Barkhof, F., Brooks, D.J., et al.: Amyloid-PET and 18F-FDG-PET in the diagnostic investigation of Alzheimer’s disease and other dementias. The Lancet Neurology19(11), 951–962 (2020)

  5. [5]

    Pattern Recognition p

    Deng, Z., Wang, H., Huang, Z., Zhang, L., Aviles-Rivero, A.I., Liu, C., He, J., Kourtzi, Z., Sch¨ onlieb, C.B.: Brain foundation models with hypergraph dynamic adapter for brain disease analysis. Pattern Recognition p. 112595 (2025)

  6. [6]

    IEEE Transac- tions on Medical Imaging44, 4037–4048 (2025)

    Ding, R., Lu, H., Liu, M.: DenseFormer-MoE: A dense transformer foundation model with mixture of experts for multi-task brain image analysis. IEEE Transac- tions on Medical Imaging44, 4037–4048 (2025)

  7. [7]

    International psychogeriatrics21(4), 672–687 (2009)

    Ellis, K.A., Bush, A.I., Darby, D., De Fazio, D., Foster, J., Hudson, P., Lauten- schlager, N.T., Lenzo, N., Martins, R.N., Maruff, P., et al.: The Australian imaging, biomarkers and lifestyle (AIBL) study of aging: methodology and baseline char- acteristics of 1112 individuals recruited for a longitudinal study of Alzheimer’s disease. International psych...

  8. [8]

    In: International Workshop on Machine Learning in Medical Imaging

    Erdur, A.C., Beischl, C., Scholz, D., Pan, J., Wiestler, B., Rueckert, D., Peeken, J.C.: MultiMAE for brain MRIs: Robustness to missing inputs using multi-modal masked autoencoder. In: International Workshop on Machine Learning in Medical Imaging. pp. 572–582. Springer (2025)

  9. [9]

    Advances in Neural Information Processing Sys- tems37, 67850–67900 (2025) 10 G

    Han, X., Nguyen, H., Harris, C., Ho, N., Saria, S.: FuseMoE: Mixture-of-experts transformers for fleximodal fusion. Advances in Neural Information Processing Sys- tems37, 67850–67900 (2025) 10 G. Yang et al

  10. [10]

    Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 6546–6555 (2018)

  11. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Doll´ ar, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15979–15988 (2022)

  12. [12]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Hu, W., Guan, Z., Yang, P., Li, J., Liu, Y., Gan, S., Cai, T., Zhang, A., Zhang, T., Qu, J., et al.: Anatomy-guided multimodal graph networks for Alzheimer’s disease: Integrative analysis of cross-modal brain connectivity signatures. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 66–75. Springer (2025)

  13. [13]

    Alzheimer’s & Dementia20(8), 5143–5169 (2024)

    Jack Jr, C.R., Andrews, J.S., Beach, T.G., Buracchio, T., Dunn, B., Graf, A., Hansson, O., Ho, C., Jagust, W., McDade, E., et al.: Revised criteria for diagnosis and staging of Alzheimer’s disease: Alzheimer’s association workgroup. Alzheimer’s & Dementia20(8), 5143–5169 (2024)

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Jang, J., Hwang, D.: M3T: three-dimensional medical image classifier using multi- plane and multi-slice transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20686–20697 (2022)

  15. [15]

    In: International conference on machine learning

    Kool, W., Van Hoof, H., Welling, M.: Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement. In: International conference on machine learning. vol. 97, pp. 3499–3508. PMLR (2019)

  16. [16]

    Computers in Biology and Medicine157, 106788 (2023)

    Leng, Y., Cui, W., Peng, Y., Yan, C., Cao, Y., Yan, Z., Chen, S., Jiang, X., Zheng, J., Initiative, A.D.N., et al.: Multimodal cross enhanced fusion network for diagnosis of Alzheimer’s disease and subjective memory complaints. Computers in Biology and Medicine157, 106788 (2023)

  17. [17]

    In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Rui, S., Chen, L., Tang, Z., Wang, L., Liu, M., Zhang, S., Wang, X.: Multi-modal vision pre-training for medical image analysis. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5164–5174 (2025)

  18. [18]

    JAMA neurology77(6), 735–745 (2020)

    Sperling, R.A., Donohue, M.C., Raman, R., Sun, C.K., Yaari, R., Holdridge, K., Siemers, E., et al.: Association of factors with elevated amyloid burden in clinically normal older individuals. JAMA neurology77(6), 735–745 (2020)

  19. [19]

    Nature Neuroscience pp

    Tak, D., Garomsa, B.A., Zapaishchykova, A., Chaunzwa, T.L., Climent Pardo, J.C., Ye, Z., Zielke, J., Ravipati, Y., Pai, S., et al.: A generalizable foundation model for analysis of human brain MRI. Nature Neuroscience pp. 1–12 (2026)

  20. [20]

    Alzheimer’s & Dementia: Translational Research & Clinical Interventions3(2), 177–188 (2017)

    Weiner, M.W., Harvey, D., Hayes, J., Landau, S.M., Aisen, P.S., Petersen, R.C., Tosun, D., Veitch, D.P., Jack Jr, C.R., Decarli, C., et al.: Effects of traumatic brain injury and posttraumatic stress disorder on development of Alzheimer’s dis- ease in vietnam veterans using the Alzheimer’s disease neuroimaging initiative: preliminary report. Alzheimer’s &...

  21. [21]

    Alzheimer’s & Dementia9(5), e111–e194 (2013)

    Weiner, M.W., Veitch, D.P., Aisen, P.S., Beckett, L.A., Cairns, N.J., Green, R.C., Harvey, D., Jack, C.R., Jagust, W., Liu, E., et al.: The Alzheimer’s disease neu- roimaging initiative: a review of papers published since its inception. Alzheimer’s & Dementia9(5), e111–e194 (2013)

  22. [22]

    IEEE Journal of Biomedical and Health Informatics29(11), 8395–8408 (2025)

    Yang, G., Du, K., Yang, Z., Du, Y., Cheung, E.Y.W., Zheng, Y., Yang, M., Kourtzi, Z., Sch¨ onlieb, C.B., Wang, S., Initiative, A.D.N.: ADFound: A foundation model for diagnosis and prognosis of Alzheimer’s disease. IEEE Journal of Biomedical and Health Informatics29(11), 8395–8408 (2025)

  23. [23]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Yin, L., Ye, C., Liu, T., Wu, J., Yan, T.: UniCross: Balanced multimodal learning for Alzheimer’s disease diagnosis by uni-modal separation and metadata-guided BrainAnytime: Anatomy-Aware Pretraining for Multi-modal Brain Imaging 11 cross-modal interaction. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 638...

  24. [24]

    Advances in Neural Information Processing Systems37, 98782–98805 (2025)

    Yun, S., Choi, I., Peng, J., Wu, Y., Bao, J., Zhang, Q., Xin, J., et al.: Flex- MoE: Modeling arbitrary modality combination via the flexible mixture-of-experts. Advances in Neural Information Processing Systems37, 98782–98805 (2025)

  25. [25]

    In: Workshop on Large Language Models and Generative AI for Health at AAAI 2025 (2025)

    Yun, S., Xin, J., Choi, I., Peng, J., Ding, Y., Long, Q., Chen, T.: Generate, then retrieve: Addressing missing modalities in multimodal learning via generative AI and MoE. In: Workshop on Large Language Models and Generative AI for Health at AAAI 2025 (2025)

  26. [26]

    Computers in Biology and Medicine162, 107050 (2023)

    Zhang, J., He, X., Liu, Y., Cai, Q., Chen, H., Qing, L.: Multi-modal cross-attention network for Alzheimer’s disease diagnosis with multi-modality data. Computers in Biology and Medicine162, 107050 (2023)

  27. [27]

    IEEE Transactions on Medical Imaging44(6), 2594–2604 (2025)

    Zhang, X., Ou, N., Basaran, B.D., Visentin, M., Qiao, M., Gu, R., Matthews, P.M., Liu, Y., Ye, C., Bai, W.: A foundation model for lesion segmentation on brain MRI with mixture of modality experts. IEEE Transactions on Medical Imaging44(6), 2594–2604 (2025)