pith. sign in

arxiv: 2606.14957 · v2 · pith:LXI5WCQMnew · submitted 2026-06-12 · 💻 cs.CV

Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

Pith reviewed 2026-06-27 04:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords neuroimagingfoundation modelmultimodal MRIlatent predictive learningmixture of expertsbrain MRIsparse representationsT1w T2w FLAIR
0
0 comments X

The pith

Neuro-JEPA uses a latent predictive objective and Mixture-of-Experts to learn unified representations from T1w, T2w, and FLAIR brain MRIs, delivering consistent gains over prior foundation models and a simple CNN across 47 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Neuro-JEPA as a new foundation model for multimodal brain MRI that pretrains on 1.55 million scans from three core sequences. It combines a latent predictive training objective with a Mixture-of-Experts architecture to produce sparse, unified representations. Across 25 clinical tasks from three health systems and 22 public-dataset tasks, Neuro-JEPA shows stronger and more consistent performance than existing neuroimaging foundation models, which in turn show inconsistent gains over a plain CNN baseline. A sympathetic reader would care because routine clinical MRIs come in multiple complementary contrasts yet lack scalable methods for joint representation learning at health-system scale.

Core claim

Neuro-JEPA, pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing, achieves stronger and more consistent performance than existing neuroimaging foundation models and a simple convolutional neural network baseline across unimodal, multimodal, and cross-domain evaluation configurations on 25 tasks from NYU Langone, NYU Long Island, and Massachusetts General Hospital plus 22 tasks from 12 public datasets.

What carries the argument

Neuro-JEPA: a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across T1w, T2w, and FLAIR sequences.

If this is right

  • A scalable methodological framework exists for multimodal neuroimaging representation learning.
  • Foundation model evaluations in neuroimaging should routinely include simple CNN baselines, clinically heterogeneous cohorts, and controlled multimodal comparisons.
  • The latent predictive plus Mixture-of-Experts design supports robust performance in unimodal, multimodal, and cross-domain settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sparse predictive approach might transfer to other multi-sequence medical imaging domains such as CT or PET.
  • Requiring simple baselines in every evaluation could reduce over-claiming of gains in medical foundation model papers.
  • Pretraining directly on multi-site health-system data may reduce the usual domain-shift problems when models move into new hospitals.

Load-bearing premise

The chosen 25 tasks from three health systems plus 22 public tasks provide a representative test of multimodal performance without hidden selection effects or site-specific biases.

What would settle it

Neuro-JEPA would lose its claim of consistent superiority if it failed to outperform the CNN baseline on a fresh collection of tasks drawn from additional health systems that use different scanner vendors or acquisition protocols.

Figures

Figures reproduced from arXiv: 2606.14957 by Arjun Masurkar, Daniel Orringer, Haoxu Huang, James Ryan Loftus, Jennifer Frontera, Jingyun Chen, Jinu Hyun, Kara Melmed, Long Chen, Narges Razavian, Seena Dehkharghani.

Figure 1
Figure 1. Figure 1: Overview of the study - the full pipeline on pre-training Neuro-JEPA with data distribution and performance evaluations. a, MRI modalities distribution on T1w, T2w and FLAIR. b, Disease distribution on pre-training data from five main categories. c, Number of patients for each task and modality on evaluated downstream datasets. d, Neuroimaging specialized pre-training architecture built upon JEPA with impr… view at source ↗
Figure 2
Figure 2. Figure 2: Best Achievable Unimodal Performance - AUROC for diagnosis/prognosis and C-index for time-to-event tasks are reported with performance from best modality across four foundation models (NeuroVFM, BrainIAC, VoCo and Neuro-JEPA) evaluated under full finetuning except NeuroVFM. The result shows that our model achieves overall best performance across the tasks. a, per task performance on best performance modali… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation of Model Design and Scaling - AUROC and AUPRC are averaged across all tasks and modalities from three health systems with attentive probing: NYU Langone, NYU Long Island, and MGH. a,b, Stepwise ablation of model design [30], in which each modification is introduced sequentially from the original V-JEPA2 implementation to the final FM-NeuroSp model, showing each design choice brings meaningful cont… view at source ↗
Figure 4
Figure 4. Figure 4: Few-shot Analysis - we examine the evaluated models label efficiency when only k = {16,32,64,128,256} positive samples are provided with full fine-tuning except NeuroVFM. The performance in reported in AUROC for classification and MAE for regression. The result demonstrates that our model performs better than evaluated models in a large margin under majority of environment with limited labeled data. a-d, F… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 1
Figure 1. Figure 1: Pretrain Data Patients Statistics - a, Number of patients, studies and total scans used in pretrain data. b, Geographic distribution of patients in pretrain data Supplementary [PITH_FULL_IMAGE:figures/full_fig_p028_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multiscale Masking Sample. Foreground is denoted as FG, background is denoted as BG and the foreground ratio for encoder/predictor are separately shown. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multiscale Masking Sample. Foreground is denoted as FG, background is denoted as BG and the foreground ratio for encoder/predictor are separately shown. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Multiscale Masking Sample. Foreground is denoted as FG, background is denoted as BG and the foreground ratio for encoder/predictor are separately shown. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per Task AUPRC Across Public Dataset Tasks and AUPRC for Best Achievable Unimodal Performance. AUPRC for public datasets on each unimodal performance and best achiable AUPRC performance across datasets and tasks. The result shows that our model demonstrates consistent performance improvement in comparison to other foundation models on AUPRC. a, AUPRC performance on different tasks with different modalities… view at source ↗
Figure 6
Figure 6. Figure 6: Unimodal Per Task AUROC and AUPRC Across Three Health System Datasets. AUROC and AUPRC with per modality performance for each task on NYU Langone, NYU Longisland and BIND-MGH datasets across all evaluated foundation models. All tasks are evaluated by full fine-tuning. Our model show improved performance across majority of tasks on different modalities a, AUROC for NYU Langone dataset. b, AUROC for NYU Long… view at source ↗
Figure 7
Figure 7. Figure 7: Kaplan Meier Curve for Time-to-Event on All Modalities. result is reported with Concordance Index (C-index) Prodromal to PD conversion for PPMI dataset and Overall Survival for UCSF-PDGM dataset. a, MCI to AD conversion within 3 years for ADNI dataset with T1w. b,c, Prodromal to PD conversion within 3 years for PPMI dataset with T1w and FLAIR. e-g, Overall Survival for UCSF-PDGM dataset with T1w, T2w and F… view at source ↗
Figure 8
Figure 8. Figure 8: Age Prediction Comparison on OpenBHB Dataset. Age prediction as regression on OpenBHB dataset on Quasi-Raw T1w scans with performance reported on R 2 , Mean Absolute Error (MAE) and Rooted Mean Squared Error (RMSE). The result shows that our model outperform existing foundation models especially with stronger fitting on patients with older age. a, regression goodness of fit for each individual foundation m… view at source ↗
Figure 9
Figure 9. Figure 9: Age Distribution on OpenBHB Dataset. We show age distribution on train, validation and test set on OpenBHB dataset. As it demonstrates, the dataset presents heavy long-tailed distribution on elder age, where the evaluated models not training on large scale clinical dataset fail to generalize. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Multimodal Learning Performance Across Fusion Methods for BrainIAC, VoCo and Neuro-JEPA on Public Datasets - AUROC. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Multimodal Learning Performance Across Fusion Methods for BrainIAC, VoCo and Neuro-JEPA on Public Datasets - AP. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Multimodal Learning Performance Across Fusion Methods for NeuroVFM on Public Datasets - AUROC and AP. AUROC and AUPRC for unimodal and multimodal performance for NeuroVFM. The best suggested method (MIL) from original paper for multimodal fusion is applied in this evaluation. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Multimodal Performance and Gain Over Unimodal on AUPRC. We report AUPRC on multimodal performance when two different modalities are combined and multimodal performance gain over unimodal defined as the difference between best multimodal combination and best unimodal performance. a, Multimodal performance on AUPRC across selected tasks on public datasets for all four compared foundation models. The result … view at source ↗
Figure 14
Figure 14. Figure 14: Mutlimodal Gain Over Unimodal for BrainIAC and VoCo. The difference between best multimodal fusion vs. best unimodal performance on AUROC and AUPRC. a,b, AUROC and AUPRC multimodal performance gain on the difference for BrainIAC. c,d, AUROC and AUPRC multimodal performance gain on the difference for VoCo. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Multimodal Learning Performance Across Fusion Methods for Neuro-JEPA on BIND-MGH - AUROC. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Multimodal Learning Performance Across Fusion Methods for Neuro-JEPA on BIND-MGH - AP. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Multimodal Learning Performance Across Fusion Methods for BrainIAC on BIND-MGH - AUROC. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Multimodal Learning Performance Across Fusion Methods for BrainIAC on BIND-MGH - AP. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Multimodal Learning Performance Across Fusion Methods for VoCo on BIND-MGH - AUROC. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Multimodal Learning Performance Across Fusion Methods for VoCo on BIND-MGH - AP. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Multimodal Learning Performance Across Fusion Methods for NeuroVFM on BIND-MGH - AUROC. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Multimodal Learning Performance Across Fusion Methods for NeuroVFM on BIND-MGH - AP. 60 [PITH_FULL_IMAGE:figures/full_fig_p060_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Multimodal Performance on BIND-MGH. We report AUROC and AUPRC for multimodal performance on all models when two different modalities are combined. The result is reported by best performance multimodal fusion method among five different methods. Dotted horizontal line present average performance across tasks, where the result shows our model outperforms other foundation models with a large margin. a, AUROC… view at source ↗
Figure 24
Figure 24. Figure 24: Multimodal Gain Over Unimodal for Neuro-JEPA on BIND-MGH. The difference between best multimodal fusion vs. best unimodal performance on Neuro-JEPA reported with AUROC and AUPRC. a, AUROC and AUPRC multimodal performance gain on the difference. b, AUPRC multimodal performance gain on the difference. 62 [PITH_FULL_IMAGE:figures/full_fig_p062_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Multimodal Gain Over Unimodal for NeuroVFM on BIND-MGH. The difference between best multimodal fusion vs. best unimodal performance on NeuroVFM reported with AUROC and AUPRC. a, AUROC and AUPRC multimodal performance gain on the difference. b, AUPRC multimodal performance gain on the difference. 63 [PITH_FULL_IMAGE:figures/full_fig_p063_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Multimodal Gain Over Unimodal for BrainIAC on BIND-MGH. The difference between best multimodal fusion vs. best unimodal performance on BrainIAC reported with AUROC and AUPRC. a, AUROC and AUPRC multimodal performance gain on the difference. b, AUPRC multimodal performance gain on the difference. 64 [PITH_FULL_IMAGE:figures/full_fig_p064_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Multimodal Gain Over Unimodal for VoCo on BIND-MGH. The difference between best multimodal fusion vs. best unimodal performance on NeuroVFM reported with AUROC and AUPRC. a, AUROC and AUPRC multimodal performance gain on the difference. b, AUPRC multimodal performance gain on the difference. 65 [PITH_FULL_IMAGE:figures/full_fig_p065_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Per Dataset Result for Number of Expert Ablation Study. Per dataset result for number of experts with attentive probing across three evaluated datasets (NYU Langone, Longisland and BIND-MGH). Consistent with average performance show in main manuscript [ref], we observed improved performance on both AUROC and AUPRC when comparing model performance on dense model vs. model with same setting on 16 total expe… view at source ↗
Figure 29
Figure 29. Figure 29: Per Dataset Result for Design Choices Ablation Study. Per dataset result for design choices with attentive probing across three evaluated datasets (NYU Langone, Longisland and BIND-MGH). The color indicates different combination of methods used during pretraining. Consistent with average performance show in main manuscript [ref], incrementally adding multiscale masking, mixture of experts and foreground a… view at source ↗
Figure 30
Figure 30. Figure 30: Per Dataset Comparison with NeuroVFM. Per dataset performance comparison with NeuroVFM. The performance is reported as AUROC and AUPRC average across all tasks on each dataset. The result shows that although both models have similar performance on AUROC, our model shows large performance improvement on AUPRC. This is critical on indicating the superior of our model as AUPRC better reflect the correctness … view at source ↗
Figure 31
Figure 31. Figure 31: Model Performance on Different Percentage of Pretrain Data. AUROC and AUPRC averaged across tasks for each dataset and modality. a-c, AUROC for different datasets on different percentage for T1w, T2w and FLAIR. e-f, AUPRC for different datasets on different percentage for T1w, T2w and FLAIR. 70 [PITH_FULL_IMAGE:figures/full_fig_p070_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Model Performance on Pretrain with Uncurated vs. Curated Data. This ablation experiments are run on 30% of pretrain data with Multi-Scale Masking, MoE and Foreground-Aware Masking all enabled. AUROC and AUPRC are averaged across tasks for each dataset and modality. a-c, AUROC on different datasets for model pretrained with uncurated vs.curated data for T1w, T2w and FLAIR. e-f, AUPRC on different datasets … view at source ↗
Figure 33
Figure 33. Figure 33: Samples of Filtered Out Noisy Scans. We show representative slices from scans excluded during scan-level quality control. Many of these scans contained limited usable anatomical information after registration, primarily due to restricted or incorrect fields of view, severe motion artifacts, acquisition or reconstruction failures, and other technical issues. Empirically, we found that pretraining with such… view at source ↗
Figure 34
Figure 34. Figure 34: MAE vs. JEPA Performance Across Datasets. We compare the performance trained on 30% of pretrain data, where we show JEPA on our full configurations with Multi-Scale Masking, MoE and Foreground Aware L1 Loss consistently outperform MAE. a,b, AUROC and AUPRC performance comparison averaged across all datasets and modalities c-e, AUROC performance comparison on each dataset and modality. f-g, AUPRC performan… view at source ↗
Figure 35
Figure 35. Figure 35: Cross-cohort out-of-domain transfer performance. AUROC and AUPRC are reported for models fine-tuned on one cohort and evaluated on an external cohort with matched task definitions. The evaluated transfer settings include NACC to ADNI for Alzheimer’s disease and amyloid prediction, and MGH to NYU for hematoma prediction. Transfer performance is compared with in-domain performance, where models are trained … view at source ↗
Figure 36
Figure 36. Figure 36: Performance comparison on DWI downstream tasks. Metrics are macro-averaged AUROC and AUPRC. For the three-class lesion type task, both metrics are computed using a one-versus-rest macro average. Bold indicates the best-performing model for each metric, and underlining indicates the second-best model. Supplementary [PITH_FULL_IMAGE:figures/full_fig_p075_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Training Dynamics on Full-data Annealing Pretraining. We show the optimization trajectory of Neuro-JEPA trained for 200 epochs annealing on the full pretraining dataset with our base model, using multiscale masking, Mixture-of-Experts (MoE) routing, and foreground-aware L1 latent predictive loss. a, Latent predictive L1 loss over training steps. b, Minimum MoE load-balancing violation, used to monitor und… view at source ↗
Figure 38
Figure 38. Figure 38: Training Dynamics on Full-data Cooldown Pretraining. We show the optimization trajectory of Neuro-JEPA trained for 40 epochs cooldown on the full pretraining dataset with our base model, using multiscale masking, Mixture-of-Experts (MoE) routing, and foreground-aware L1 latent predictive loss. a, Latent predictive L1 loss over training steps. b, Minimum MoE load-balancing violation, used to monitor under-… view at source ↗
Figure 39
Figure 39. Figure 39: All Evaluated Models vs. Simple CNN Baseline. AUROC and AUPRC for 41 different combinations on public datasets. a, AUROC comparison across tasks and modalities. b, AUPRC comparison across tasks and modalities. 77 [PITH_FULL_IMAGE:figures/full_fig_p077_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Few-shot Analysis - we examine the evaluated models label efficiency when only k = {16,32,64,128,256} positive samples are provided on more diverse selected tasks. The performance in reported in AUROC for classification and MAE for regression. a-f, Few-shot performance on selected tasks from public datasets. All result is reported as averaged performance across all available modalities for each task g-l, … view at source ↗
Figure 41
Figure 41. Figure 41: Few-shot Analysis - we examine the evaluated models label efficiency when only k = {16,32,64,128,256} positive samples are provided on more diverse selected tasks. The performance in reported in AUPRC for classification. a-e, Few-shot performance on selected tasks from public datasets. All result is reported as averaged performance across all available modalities for each task f-k, Few-shot performance on… view at source ↗
Figure 42
Figure 42. Figure 42: AUROC Few-shot Analysis on T1w - few-shot performance across tasks for T1w. 81 [PITH_FULL_IMAGE:figures/full_fig_p081_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: AUROC Few-shot Analysis on T2w - few-shot performance across tasks for T2w. 82 [PITH_FULL_IMAGE:figures/full_fig_p082_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: AUROC Few-shot Analysis on FLAIR - few-shot performance across tasks for FLAIR. 83 [PITH_FULL_IMAGE:figures/full_fig_p083_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: AUPRC Few-shot Analysis on T1w - few-shot performance across tasks for T1w. 84 [PITH_FULL_IMAGE:figures/full_fig_p084_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: AUPRC Few-shot Analysis on T2w - few-shot performance across tasks for T2w. 85 [PITH_FULL_IMAGE:figures/full_fig_p085_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: AUPRC Few-shot Analysis on FLAIR - few-shot performance across tasks for FLAIR. 86 [PITH_FULL_IMAGE:figures/full_fig_p086_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: Fairness Analysis across sub-cohorts. Fairness comparison on representative diseases (Cancer, Hydrocephalus, Edema, Dementia) from NYU-Langone dataset across different sub-cohorts by attentive probing. a, AUROC on difference sub-cohorts and diseases. b, Fairness gap on maximum AUROC minus minimum AUROC for each sub-cohort and disease H.4 TSNE Visualization on Age Groups We evaluated the separability of ag… view at source ↗
Figure 49
Figure 49. Figure 49: Age TSNE Across Different Pretrained Models. The plot present TSNE visualization on different age subgroups (Young (0-24), Adult (35-44) and Senior (65+)) and modalities (T1w, T2w, T2-FLAIR) for NYU Langone Dataset. The silhouette scores and visualization present that Neuro-JEPA shows best separation on age subgroups. 88 [PITH_FULL_IMAGE:figures/full_fig_p088_49.png] view at source ↗
Figure 50
Figure 50. Figure 50: Supplementary MoE Routing FG vs BG – T1w NYU E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 0.00 0.05 0.10 0.15 0.20 Routing Frequency (%) Layer 1 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 Layer 3 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 Layer 5 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 0.00 0.05 0.10 0.15 0.20 Routing Frequency (%) Layer 7 E1 E2 E3 E4 E5 E6… view at source ↗
Figure 51
Figure 51. Figure 51: Supplementary MoE Routing FG vs BG – T2w NYU E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 0.00 0.05 0.10 0.15 0.20 Routing Frequency (%) Layer 1 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 Layer 3 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 Layer 5 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 0.00 0.05 0.10 0.15 0.20 Routing Frequency (%) Layer 7 E1 E2 E3 E4 E5 E6… view at source ↗
Figure 52
Figure 52. Figure 52: Supplementary MoE Routing FG vs BG – FLAIR NYU 90 [PITH_FULL_IMAGE:figures/full_fig_p090_52.png] view at source ↗
Figure 53
Figure 53. Figure 53: Supplementary MoE Routing FG vs BG – T1w MGH E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 0.00 0.05 0.10 0.15 0.20 Routing Frequency (%) Layer 1 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 Layer 3 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 Layer 5 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 0.00 0.05 0.10 0.15 0.20 Routing Frequency (%) Layer 7 E1 E2 E3 E4 E5 E6… view at source ↗
Figure 54
Figure 54. Figure 54: Supplementary MoE Routing FG vs BG – T2w MGH E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 0.00 0.05 0.10 0.15 0.20 Routing Frequency (%) Layer 1 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 Layer 3 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 Layer 5 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 0.00 0.05 0.10 0.15 0.20 Routing Frequency (%) Layer 7 E1 E2 E3 E4 E5 E6… view at source ↗
Figure 55
Figure 55. Figure 55: Supplementary MoE Routing FG vs BG – FLAIR MGH 91 [PITH_FULL_IMAGE:figures/full_fig_p091_55.png] view at source ↗
Figure 56
Figure 56. Figure 56: Supplementary MoE Routing on Different Modalities – NYU E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 0.0 0.1 0.2 0.3 0.4 0.5 Routing Frequency (%) Layer 1 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 Layer 3 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 Layer 5 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 0.0 0.1 0.2 0.3 0.4 0.5 Routing Frequency (%) Layer 7 E1 E2 E3… view at source ↗
Figure 57
Figure 57. Figure 57: Supplementary MoE Routing on Different Modalities – MGH 92 [PITH_FULL_IMAGE:figures/full_fig_p092_57.png] view at source ↗
Figure 58
Figure 58. Figure 58: Supplementary MoE Routing Heatmaps – NYU T1w 0 82 164 246 328 410 492 575 Token index E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Expert 6.5% 8.0% 5.2% 3.4% 7.5% 8.9% 4.5% 4.7% 8.3% 2.8% 8.4% 7.2% 7.4% 5.8% 1.5% 9.8% Layer 1 0 82 164 246 328 410 492 575 Token index E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Expert 3.1% 4.8% 6.2% 8.3% 3.9% 7.6% 3.8% 8.6% 8.4% 6.4% 8.5% 6.8% 5.8% 7.2% 6… view at source ↗
Figure 59
Figure 59. Figure 59: Supplementary MoE Routing Heatmaps – NYU T2w 93 [PITH_FULL_IMAGE:figures/full_fig_p093_59.png] view at source ↗
Figure 60
Figure 60. Figure 60: Supplementary MoE Routing Heatmaps – NYU FLAIR 0 82 164 246 328 410 492 575 Token index E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Expert 8.4% 7.8% 6.2% 4.0% 8.6% 7.4% 5.6% 4.2% 6.8% 4.2% 7.5% 6.3% 7.3% 5.8% 1.6% 8.2% Layer 1 0 82 164 246 328 410 492 575 Token index E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Expert 3.7% 6.4% 6.4% 8.0% 5.0% 7.0% 3.6% 7.8% 6.7% 6.7% 7.6% 5.7% 6.5% 6.7%… view at source ↗
Figure 61
Figure 61. Figure 61: Supplementary MoE Routing Heatmaps – MGH T1w 94 [PITH_FULL_IMAGE:figures/full_fig_p094_61.png] view at source ↗
Figure 62
Figure 62. Figure 62: Supplementary MoE Routing Heatmaps – MGH T2w 0 82 164 246 328 410 492 575 Token index E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Expert 5.2% 7.9% 4.7% 3.8% 5.3% 8.9% 4.2% 5.9% 9.4% 3.6% 9.3% 5.7% 8.2% 4.2% 2.5% 11.2% Layer 1 0 82 164 246 328 410 492 575 Token index E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Expert 2.6% 3.8% 6.2% 9.5% 3.0% 8.4% 3.3% 9.8% 9.1% 5.9% 8.6% 6.8% 5.1% 6.9% … view at source ↗
Figure 63
Figure 63. Figure 63: Supplementary MoE Routing Heatmaps – MGH FLAIR 95 [PITH_FULL_IMAGE:figures/full_fig_p095_63.png] view at source ↗
Figure 64
Figure 64. Figure 64: MoE Visualization on T1w template. 96 [PITH_FULL_IMAGE:figures/full_fig_p096_64.png] view at source ↗
Figure 65
Figure 65. Figure 65: MoE Visualization on T2w template. 97 [PITH_FULL_IMAGE:figures/full_fig_p097_65.png] view at source ↗
read the original abstract

Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Neuro-JEPA, a sparse multimodal neuroimaging foundation model combining a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across T1w, T2w, and FLAIR sequences. Pretrained on 1,551,862 scans from 428,647 studies, the model is evaluated on 25 tasks from three health systems (NYU Langone, NYU Long Island, MGH) and 22 tasks from 12 public datasets. The central claim is that Neuro-JEPA delivers stronger and more consistent performance across unimodal, multimodal, and cross-domain settings than existing neuroimaging foundation models, which showed inconsistent gains over a simple CNN baseline. The work also reports a systematic study of architectural, masking, objective, and sparsity choices.

Significance. If the performance claims hold with appropriate controls, this would represent a meaningful advance in scalable multimodal representation learning for neuroimaging by addressing the lack of unified methods across MRI contrasts at health-system scale. The large pretraining corpus and explicit inclusion of a CNN baseline are notable strengths that could help establish more reliable evaluation practices.

major comments (2)
  1. [Abstract] Abstract: The assertion that Neuro-JEPA 'achieved stronger and more consistent performance across all evaluated settings' is presented without any quantitative metrics, error bars, statistical tests, result tables, or ablation summaries. This absence is load-bearing for the central empirical claim, as it prevents verification of the reported gains relative to the CNN baseline or prior models.
  2. [Benchmark suite description] Benchmark suite description (25 tasks from three health systems plus 22 public tasks): No information is supplied on a priori task registration, exclusion criteria, potential subject overlap with the pretraining data, or statistical adjustment for scanner/protocol heterogeneity across sites. These details are required to isolate the consistency claim from selection bias or site effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that Neuro-JEPA 'achieved stronger and more consistent performance across all evaluated settings' is presented without any quantitative metrics, error bars, statistical tests, result tables, or ablation summaries. This absence is load-bearing for the central empirical claim, as it prevents verification of the reported gains relative to the CNN baseline or prior models.

    Authors: The abstract is a concise summary; full quantitative results with metrics, error bars, statistical tests, tables vs. CNN baseline and prior models, and ablations appear in Section 4 and supplements. We will revise the abstract to include key quantitative highlights (e.g., average gains and consistency metrics) while respecting length limits. revision: yes

  2. Referee: [Benchmark suite description] Benchmark suite description (25 tasks from three health systems plus 22 public tasks): No information is supplied on a priori task registration, exclusion criteria, potential subject overlap with the pretraining data, or statistical adjustment for scanner/protocol heterogeneity across sites. These details are required to isolate the consistency claim from selection bias or site effects.

    Authors: We agree these details strengthen the consistency claims. The Methods section describes the tasks but omits explicit selection process. We will add a subsection on benchmark construction covering a priori criteria, exclusion rules, overlap verification (public datasets are disjoint; clinical tasks use held-out studies), and site-heterogeneity handling via per-site normalization and mixed-effects models. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks without self-referential fitting or derivation

full rationale

The paper introduces Neuro-JEPA via pretraining on 1.55M scans followed by evaluation on 47 tasks across health systems and public datasets. No equations, latent predictive objectives, or architectural choices are shown to reduce by construction to fitted inputs or self-citations; performance claims are presented as direct empirical outcomes against a CNN baseline and other models. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the unstated premise that the chosen architectural and objective design choices generalize across the reported clinical and public datasets; no explicit free parameters, mathematical axioms, or new physical entities are described.

invented entities (1)
  • Neuro-JEPA no independent evidence
    purpose: sparse multimodal neuroimaging foundation model combining latent predictive objective with Mixture-of-Experts
    New model name and architecture introduced as the primary contribution.

pith-pipeline@v0.9.1-grok · 5842 in / 1372 out tokens · 50548 ms · 2026-06-27T04:34:18.144112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    Commun.13, 3404, DOI: 10.1038/s41467-022-31037-5 (2022)

    Qiu, S.et al.Multimodal deep learning for alzheimer’s disease dementia assessment.Nat. Commun.13, 3404, DOI: 10.1038/s41467-022-31037-5 (2022). 2.Castellano, A. & Falini, A. Progress in neuro-imaging of brain tumors.Curr Opin Oncol28, 484–493 (2016). 3.Gupta, A.et al.Neuroimaging of cerebrovascular disease in the aging brain.Aging Dis3, 414–425 (2012). 4....

  2. [2]

    D., Leslie, S

    Patel, D. D., Leslie, S. W. & Shetty, M.Appropriate Magnetic Resonance Imaging Ordering(StatPearls Publishing, Treasure Island (FL), 2026). [Updated 2025 Nov 7]

  3. [3]

    InInternational Conference on Learning Representations(2021)

    Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations(2021)

  4. [5]

    Assran, M.et al.V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985(2025)

  5. [6]

    A path towards autonomous machine intelligence version 0.9.2, 2022-06-27.Open Rev.62, 1–62 (2022)

    LeCun, Y . A path towards autonomous machine intelligence version 0.9.2, 2022-06-27.Open Rev.62, 1–62 (2022)

  6. [7]

    & Shazeer, N

    Fedus, W., Zoph, B. & Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.J. Mach. Learn. Res.23, 1–39 (2022)

  7. [8]

    In Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P

    Riquelme, C.et al.Scaling vision with sparse mixture of experts. In Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P. & Vaughan, J. W. (eds.)Advances in Neural Information Processing Systems, vol. 34, 8583–8595 (Curran Associates, Inc., 2021)

  8. [9]

    & Jacobs, R

    Jordan, M. & Jacobs, R. Hierarchical mixtures of experts and the em algorithm. InProceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan), vol. 2, 1339–1344 vol.2, DOI: 10.1109/ IJCNN.1993.716791 (1993)

  9. [10]

    R., Puigcerver, J., Jenatton, R

    Mustafa, B., Ruiz, C. R., Puigcerver, J., Jenatton, R. & Houlsby, N. Multimodal contrastive learning with LIMoe: the language-image mixture of experts. In Oh, A. H., Agarwal, A., Belgrave, D. & Cho, K. (eds.)Advances in Neural Information Processing Systems(2022)

  10. [11]

    InInternational Conference on Learning Representations(2017)

    Shazeer, N.et al.Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations(2017). 21

  11. [12]

    Verschuur, E

    Tak, D.et al.A generalizable foundation model for analysis of human brain mri.Nat. Neurosci.DOI: 10.1038/s4 1593-026-02202-6 (2026)

  12. [13]

    & Chen, H

    Wu, L., Zhuang, J. & Chen, H. Large-scale 3d medical image pre-training with geometric context priors.IEEE Transactions on Pattern Analysis Mach. Intell.(2025)

  13. [14]

    & Chen, H

    Wu, L., Zhuang, J. & Chen, H. Large-scale 3d medical image pre-training with geometric context priors.IEEE Transactions on Pattern Analysis Mach. Intell.48, 3801–3818, DOI: 10.1109/TPAMI.2025.3639593 (2026). 18.Kondepudi, A.et al.Health system learning achieves generalist neuroimaging models (2025). 2511.18640

  14. [15]

    Medicine30, 2924–2935, DOI: 10.1038/s41591-024-03141-0 (2024)

    V orontsov, E.et al.A foundation model for clinical-grade computational pathology and rare cancers detection.Nat. Medicine30, 2924–2935, DOI: 10.1038/s41591-024-03141-0 (2024)

  15. [16]

    J.et al.Towards a general-purpose foundation model for computational pathology.Nat

    Chen, R. J.et al.Towards a general-purpose foundation model for computational pathology.Nat. Medicine(2024)

  16. [17]

    In Linguraru, M

    Isensee, F.et al.nnu-net revisited: A call for rigorous validation in 3d medical image segmentation. In Linguraru, M. G.et al.(eds.)Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, 488–498 (Springer Nature Switzerland, Cham, 2024)

  17. [18]

    InThe Thirteenth International Conference on Learning Representations(2025)

    Xu, Z.et al.Specialized foundation models struggle to beat supervised baselines. InThe Thirteenth International Conference on Learning Representations(2025)

  18. [19]

    URLhttps://doi.org/10.1038/s41592-025-02772-6

    Ahlmann-Eltze, C., Huber, W. & Anders, S. Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines.Nat. Methods22, 1657–1661, DOI: 10.1038/s41592-025-02772-6 (2025)

  19. [20]

    and Rusinek, Henry and Chen, Jingyun and Ben, Zhang and Zhu, Weicheng and Fernandez-Grande, Carlos and Razavian, Narges , TITLE =

    Liu, S.et al.Generalizable deep learning model for early alzheimer’s disease detection from structural mris.Sci. Reports12, 17106, DOI: 10.1038/s41598-022-20674-x (2022). 25.Asadi, M.et al.Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687(2026)

  20. [21]

    In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp

    Wang, W., Tran, D. & Feiszli, M. What makes training multi-modal classification networks hard? In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12692–12702, DOI: 10.1109/CVPR 42600.2020.01271 (2020)

  21. [22]

    & Parikh, D.RUBi: reducing unimodal biases for visual question answering(Curran Associates Inc., Red Hook, NY , USA, 2019)

    Cadene, R., Dancette, C., Ben-younes, H., Cord, M. & Parikh, D.RUBi: reducing unimodal biases for visual question answering(Curran Associates Inc., Red Hook, NY , USA, 2019)

  22. [23]

    & Sysko-Roma´nczuk, S

    Pawłowski, M., Wróblewska, A. & Sysko-Roma´nczuk, S. Effective techniques for multimodal data fusion: A comparative analysis.Sensors23, DOI: 10.3390/s23052381 (2023). 29.Liang, P. P. Foundations of multisensory artificial intelligence (2024). 2404.18976

  23. [24]

    Assembly101: A large-scale multi-view video dataset for understanding procedural activities

    Liu, Z.et al.A convnet for the 2020s. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11966–11976, DOI: 10.1109/CVPR52688.2022.01167 (2022)

  24. [25]

    Tong, S.et al.Beyond language modeling: An exploration of multimodal pretraining.arXiv preprint arXiv:2603.03276(2026)

  25. [26]

    Lyu, Y .et al.Learning neuroimaging models from health system-scale data.Nat. Biomed. Eng.DOI: 10.1038/s415 51-025-01608-0 (2026). 33.Team, G.et al.Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  26. [27]

    & Liang, P

    Dai, W., Chen, P., Ekbote, C. & Liang, P. P. Qoq-med: Building multimodal clinical foundation models with domain-aware GRPO training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

  27. [28]

    InThe Thirteenth International Conference on Learning Representations(2025)

    Zhou, C.et al.Transfusion: Predict the next token and diffuse images with one multi-modal model. InThe Thirteenth International Conference on Learning Representations(2025). 36.Bai, S.et al.Qwen3-vl technical report (2025). 2511.21631

  28. [29]

    InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)(2025)

    Shukor, M.et al.Scaling laws for native multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)(2025)

  29. [30]

    Oquab, M.et al.DINOv2: Learning robust visual features without supervision.Transactions on Mach. Learn. Res. (2024). Featured Certification

  30. [31]

    & LeCun, Y

    Balestriero, R. & LeCun, Y . Lejepa: Provable and scalable self-supervised learning without the heuristics (2025). 2511.08544

  31. [32]

    L., Scieur, D., LeCun, Y

    Maes, L., Lidec, Q. L., Scieur, D., LeCun, Y . & Balestriero, R. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels (2026). 2603.19312

  32. [33]

    F., Behrens, T

    Jenkinson, M., Beckmann, C. F., Behrens, T. E., Woolrich, M. W. & Smith, S. M. Fsl.NeuroImage62, 782–790, DOI: https://doi.org/10.1016/j.neuroimage.2011.09.015 (2012). 20 YEARS OF fMRI. 22

  33. [34]

    S., Dalca, A

    Hoopes, A., Mora, J. S., Dalca, A. V ., Fischl, B. & Hoffmann, M. SynthStrip: skull-stripping for any brain image. NeuroImage260, 119474 (2022)

  34. [35]

    Medicine30, 2977–2989, DOI: 10.1038/s41591-024-03118-z (2024)

    Xue, C.et al.Ai-based differential diagnosis of dementia etiologies on multimodal data.Nat. Medicine30, 2977–2989, DOI: 10.1038/s41591-024-03118-z (2024)

  35. [36]

    & Morcos, A

    Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S. & Morcos, A. Beyond neural scaling laws: beating power law scaling via data pruning. In Koyejo, S.et al.(eds.)Advances in Neural Information Processing Systems, vol. 35, 19523–19536 (Curran Associates, Inc., 2022)

  36. [37]

    In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23 (Curran Associates Inc., Red Hook, NY , USA, 2023)

    Penedo, G.et al.The refinedweb dataset for falcon llm: outperforming curated corpora with web data only. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23 (Curran Associates Inc., Red Hook, NY , USA, 2023)

  37. [38]

    InThe Twelfth International Conference on Learning Representations (2024)

    Xu, H.et al.Demystifying CLIP data. InThe Twelfth International Conference on Learning Representations (2024)

  38. [39]

    & Prevete, R

    Apicella, A., Isgrò, F. & Prevete, R. Don’t push the button! exploring data leakage risks in machine learning and transfer learning.Artif. Intell. Rev.58, 339, DOI: 10.1007/s10462-025-11326-3 (2025)

  39. [40]

    & Ranganath, R

    Compton, R., Zhang, L., Puli, A. & Ranganath, R. When more is less: Incorporating additional datasets can hurt performance by introducing spurious correlations. In Deshpande, K.et al.(eds.)Proceedings of the 8th Machine Learning for Healthcare Conference, vol. 219 ofProceedings of Machine Learning Research, 110–127 (PMLR, 2023)

  40. [41]

    Image Analysis63, 101694, DOI: https://doi.org/10.1016/j.media.2020.101694 (2020)

    Wen, J.et al.Convolutional neural networks for classification of alzheimer’s disease: Overview and reproducible evaluation.Med. Image Analysis63, 101694, DOI: https://doi.org/10.1016/j.media.2020.101694 (2020)

  41. [42]

    J., Janizek, J

    DeGrave, A. J., Janizek, J. D. & Lee, S.-I. Ai for radiographic covid-19 detection selects shortcuts over signal.Nat. Mach. Intell.3, 610–619, DOI: 10.1038/s42256-021-00338-7 (2021). 51.Samala, R. K., Chan, H.-P., Hadjiiski, L. & Helvie, M. A. Risks of feature leakage and sample size dependencies in deep feature extraction for breast mass classification.M...

  42. [43]

    Medicine2, 99, DOI: 10.1038/s41746-019-0178-x (2019)

    Chaibub Neto, E.et al.Detecting the impact of subject characteristics on machine learning-based diagnostic applications.npj Digit. Medicine2, 99, DOI: 10.1038/s41746-019-0178-x (2019)

  43. [44]

    https://www.medrxiv.org/content/earl y/2025/10/02/2025.10.01.25337054.full.pdf

    Maschke, C.et al.The brain imaging and neurophysiology database: Binding multimodal neural data into a large-scale repository.medRxivDOI: 10.1101/2025.10.01.25337054 (2025). https://www.medrxiv.org/content/earl y/2025/10/02/2025.10.01.25337054.full.pdf

  44. [45]

    https://huggingface.co/ContactDoctor/Bio-Medical-Llama-3-8B (2024)

    Contactdoctor-bio-medical: A high-performance biomedical language model. https://huggingface.co/ContactDoctor/Bio-Medical-Llama-3-8B (2024)

  45. [46]

    Psychiatry19, 659–667, DOI: 10.1038/mp.2013.78 (2014)

    Di Martino, A.et al.The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism.Mol. Psychiatry19, 659–667, DOI: 10.1038/mp.2013.78 (2014). 56.Bellec, P.et al.The neuro bureau ADHD-200 preprocessed repository.Neuroimage144, 275–286 (2016)

  46. [47]

    C.et al.Alzheimer’s disease neuroimaging initiative (ADNI): clinical characterization.Neurology74, 201–209 (2009)

    Petersen, R. C.et al.Alzheimer’s disease neuroimaging initiative (ADNI): clinical characterization.Neurology74, 201–209 (2009)

  47. [48]

    Data10, 548, DOI: 10.1038/s41597-023-02457-9 (2023)

    Liu, C.-F.et al.A large public dataset of annotated clinical mris and metadata of patients with acute stroke.Sci. Data10, 548, DOI: 10.1038/s41597-023-02457-9 (2023)

  48. [49]

    O.et al.The mayo clinic study of aging: design and sampling, participation, baseline measures and sample characteristics.Neuroepidemiology30, 58–69 (2008)

    Roberts, R. O.et al.The mayo clinic study of aging: design and sampling, participation, baseline measures and sample characteristics.Neuroepidemiology30, 58–69 (2008)

  49. [50]

    L.et al.The national alzheimer’s coordinating center (NACC) database: The uniform data set.Alzheimer Dis

    Beekly, D. L.et al.The national alzheimer’s coordinating center (NACC) database: The uniform data set.Alzheimer Dis. Assoc. Disord.21, 249–258 (2007)

  50. [51]

    S., Fotenos, A

    Marcus, D. S., Fotenos, A. F., Csernansky, J. G., Morris, J. C. & Buckner, R. L. Open access series of imaging studies: longitudinal MRI data in nondemented and demented older adults.J. Cogn. Neurosci.22, 2677–2684 (2010)

  51. [52]

    The parkinson progression marker initiative (PPMI).Prog

    Parkinson Progression Marker Initiative. The parkinson progression marker initiative (PPMI).Prog. Neurobiol.95, 629–635 (2011)

  52. [53]

    Data11, 839, DOI: 10.1038/s41597-024-03667-5 (2024)

    Absher, J.et al.The stroke outcome optimization project: Acute ischemic strokes from a comprehensive stroke center.Sci. Data11, 839, DOI: 10.1038/s41597-024-03667-5 (2024)

  53. [54]

    J., Durnez, J

    Gorgolewski, K. J., Durnez, J. & Poldrack, R. A. Preprocessed consortium for neuropsychiatric phenomics dataset. 23 F1000Res.6, 1262 (2017)

  54. [55]

    Calabrese, E.et al.The university of california san francisco preoperative diffuse glioma mri (ucsf-pdgm), DOI: 10.7937/tcia.bdgf-8v37 (2022)

  55. [56]

    Zhu, W.et al.3d foundation model for generalizable disease detection in head computed tomography.Nat. Biomed. Eng.DOI: 10.1038/s41551-026-01668-w (2026)

  56. [57]

    InProceedings of the International Conference on Computer Vision (ICCV)(2021)

    Caron, M.et al.Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV)(2021). 68.Siméoni, O.et al.Dinov3 (2025). 2508.10104

  57. [58]

    & LeCun, Y

    Terver, B., Yang, T.-Y ., Ponce, J., Bardes, A. & LeCun, Y . What drives success in physical planning with joint-embedding predictive world models? (2026). 2512.24497. 70.Munim, A.et al.Echojepa: A latent predictive foundation model for echocardiography (2026). 2602.02603

  58. [59]

    In Globerson, A.et al.(eds.)Advances in Neural Information Processing Systems, vol

    Dong, Z.et al.Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking. In Globerson, A.et al.(eds.)Advances in Neural Information Processing Systems, vol. 37, 86048–86073, DOI: 10.52202/079017-2732 (Curran Associates, Inc., 2024)

  59. [60]

    A journey through moe: 5

    Su, J. A journey through moe: 5. reflections on uniform distribution. https://kexue.fm/archives/10945 (2025). Accessed 2026-03-25. 73.Team, K.et al.Kimi k2: Open agentic intelligence (2026). 2507.20534

  60. [61]

    Guo, D.et al.Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature645, 633–638, DOI: 10.1038/s41586-025-09422-z (2025)

  61. [62]

    & Dai, D

    Wang, L., Gao, H., Zhao, C., Sun, X. & Dai, D. Auxiliary-loss-free load balancing strategy for mixture-of-experts (2024). 2408.15664

  62. [63]

    L., Maes, L., LeCun, Y

    Nam, H., Lidec, Q. L., Maes, L., LeCun, Y . & Balestriero, R. Causal-jepa: Learning world models through object-level latent interventions (2026). 2602.11389

  63. [64]

    Assembly101: A large-scale multi-view video dataset for understanding procedural activities

    Tang, Y .et al.Self-supervised pre-training of swin transformers for 3d medical image analysis. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20698–20708, DOI: 10.1109/CVPR52688.2022 .02007 (2022)

  64. [65]

    & Hinton, G

    Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In III, H. D. & Singh, A. (eds.)Proceedings of the 37th International Conference on Machine Learning, vol. 119 ofProceedings of Machine Learning Research, 1597–1607 (PMLR, 2020)

  65. [66]

    FlashAttention-2: Faster attention with better parallelism and work partitioning

    Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations (ICLR)(2024)

  66. [67]

    P.et al.Quantifying & modeling multimodal interactions: An information decomposition framework

    Liang, P. P.et al.Quantifying & modeling multimodal interactions: An information decomposition framework. In Thirty-seventh Conference on Neural Information Processing Systems(2023)

  67. [68]

    & Jurie, F

    Pérez-Rúa, J.-M., Vielzeuf, V ., Pateux, S., Baccouche, M. & Jurie, F. Mfas: Multimodal fusion architecture search. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6966–6975 (2019)

  68. [69]

    & Cho, K

    Madaan, D., Makino, T., Chopra, S. & Cho, K. Jointly modeling inter- & intra-modality dependencies for multi-modal learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems(2024). 83.Hoffmann, J.et al.Training compute-optimal large language models (2022). 2203.15556

  69. [70]

    & Song, J

    Pearce, T. & Song, J. Reconciling kaplan and chinchilla scaling laws.Transactions on Mach. Learn. Res.(2024). Reproducibility Certification

  70. [71]

    InThe Thirteenth International Conference on Learning Representations (2025)

    Kumar, T.et al.Scaling laws for precision. InThe Thirteenth International Conference on Learning Representations (2025)

  71. [72]

    Hinton, G. E. Training products of experts by minimizing contrastive divergence.Neural Comput.14, 1771–1800, DOI: 10.1162/089976602760128018 (2002)

  72. [73]

    Nature Methods (2021)

    Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nat. Methods18, 203–211, DOI: 10.1038/s41592-020-01008-z (2021)

  73. [74]

    Commun.15, 654, DOI: 10.1038/s41467-024-44824-z (2024)

    Ma, J.et al.Segment anything in medical images.Nat. Commun.15, 654, DOI: 10.1038/s41467-024-44824-z (2024)

  74. [75]

    Methods22, 166–176, DOI: 10.1038/s41592-024-02499-w (2025)

    Zhao, T.et al.A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities.Nat. Methods22, 166–176, DOI: 10.1038/s41592-024-02499-w (2025). 24

  75. [76]

    Live: Learning video llm with stream- ing speech transcription at scale

    He, Y .et al.Vista3d: A unified segmentation foundation model for 3d medical imaging. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20863–20873, DOI: 10.1109/CVPR52734.2025 .01943 (2025). 91.Asadi, M.et al.Mirage: The illusion of visual understanding (2026). 2603.21687

  76. [77]

    classify-then-aggregate

    Madaan, D., Muhunthan, V ., Cho, K. & Chopra, S. Multi-modal data spectrum: Multi-modal datasets are multi-dimensional. InThe Fourteenth International Conference on Learning Representations(2026). 25 Appendix Contents References 21 A Dataset Details 27 A.1 Pretrain Data Demographics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....