A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning
Pith reviewed 2026-05-19 09:49 UTC · model grok-4.3
The pith
FOMO260K aggregates 260927 heterogeneous brain MRI scans from 910 sources to support large-scale self-supervised learning in medical imaging.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors have compiled and released FOMO260K, consisting of 260927 3D brain MRI scans from 77589 sessions and 55378 subjects aggregated from 910 publicly available sources, featuring both clinical and research grade images, multiple sequences, and substantial anatomical and pathological variability including large brain anomalies, with minimal preprocessing to preserve original characteristics and support self-supervised pretraining and benchmarking at scale.
What carries the argument
The FOMO260K dataset, which aggregates minimally preprocessed scans from diverse public sources to capture wide anatomical, pathological, and acquisition variability for self-supervised model development.
If this is right
- The provided companion code allows direct pretraining of self-supervised models on FOMO260K followed by finetuning on downstream tasks.
- The dataset enables systematic benchmarking of self-supervised methods on data that reflects real-world heterogeneity in MRI acquisition and pathology.
- Pretrained models released alongside the dataset can serve as initial weights for a range of brain MRI analysis applications.
- Models developed this way are expected to handle new scans from varied protocols and populations more reliably than those trained on limited data.
Where Pith is reading between the lines
- Hospitals could adapt models pretrained on this scale of data to local scanner fleets with less need for new labeled examples.
- Similar aggregation strategies might be applied to other imaging modalities to create cross-domain self-supervised resources.
- The emphasis on minimal preprocessing could encourage dataset creators to prioritize accessibility over aggressive standardization in future releases.
Load-bearing premise
Aggregating minimally preprocessed scans from 910 heterogeneous public sources will yield a dataset with enough variability and quality to train generalizable self-supervised models without source-specific biases dominating.
What would settle it
An experiment in which self-supervised models pretrained on FOMO260K show no better generalization to unseen clinical datasets than models pretrained on smaller homogeneous collections would indicate the central claim does not hold.
Figures
read the original abstract
We present FOMO260K, a large-scale, heterogeneous dataset of 260,927 brain Magnetic Resonance Imaging (MRI) scans from 77,589 MRI sessions and 55,378 subjects, aggregated from 910 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing entry barriers for new users. Companion code for self-supervised pretraining and finetuning is provided, along with pretrained models. FOMO260K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FOMO260K, a dataset of 260,927 brain MRI scans from 77,589 sessions and 55,378 subjects aggregated from 910 public sources. It includes clinical and research-grade images across multiple sequences and a range of anatomical/pathological variability, with only minimal preprocessing applied to preserve original characteristics. Companion code for self-supervised pretraining/finetuning and pretrained models are provided, with the explicit goal of supporting large-scale SSL development and benchmarking in medical imaging.
Significance. If the dataset's scale and heterogeneity can be shown to reflect clinically relevant variability rather than unmitigated acquisition artifacts, the release would be a substantial contribution to SSL research in medical imaging, where existing public datasets are typically orders of magnitude smaller. The provision of reproducible code and pretrained models is a clear strength that lowers barriers to entry and supports immediate use.
major comments (2)
- [Abstract] Abstract: The central claim that FOMO260K supports 'development and benchmarking of self-supervised learning methods ... at scale' rests on the assumption that inter-source heterogeneity primarily captures anatomical and pathological variability. However, the abstract provides no quantitative source-stratified statistics (e.g., field strength distributions, sequence parameter ranges, or intensity histogram comparisons) to substantiate this, leaving open the possibility that models learn scanner-specific signatures instead.
- [Methods] Methods / Dataset construction: Minimal preprocessing is described as preserving original image characteristics, yet the manuscript reports no harmonization steps, artifact audits, or cross-source quality metrics. This directly impacts the weakest assumption that the aggregated data will yield generalizable SSL features; without such checks, domain shifts from acquisition protocols remain unmitigated and could dominate the training signal.
minor comments (1)
- [Results] The manuscript would benefit from an explicit table summarizing key acquisition metadata (field strength, scanner vendor, sequence type) broken down by source or at least by major cohorts to allow readers to assess heterogeneity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript describing the FOMO260K dataset. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that FOMO260K supports 'development and benchmarking of self-supervised learning methods ... at scale' rests on the assumption that inter-source heterogeneity primarily captures anatomical and pathological variability. However, the abstract provides no quantitative source-stratified statistics (e.g., field strength distributions, sequence parameter ranges, or intensity histogram comparisons) to substantiate this, leaving open the possibility that models learn scanner-specific signatures instead.
Authors: We agree that including quantitative source-stratified statistics in the abstract would better support the central claim. In the revised version, we will add a sentence summarizing key heterogeneity metrics, including the distribution of field strengths (e.g., 1.5T vs 3T) and sequence types across sources. Regarding the risk of models learning scanner-specific signatures, this is a known challenge in multi-source medical datasets; our intent is to provide data that reflects real clinical acquisition variability, which SSL methods must ultimately handle for practical deployment. We will expand the discussion section to note this limitation and suggest that users evaluate feature invariance. revision: yes
-
Referee: [Methods] Methods / Dataset construction: Minimal preprocessing is described as preserving original image characteristics, yet the manuscript reports no harmonization steps, artifact audits, or cross-source quality metrics. This directly impacts the weakest assumption that the aggregated data will yield generalizable SSL features; without such checks, domain shifts from acquisition protocols remain unmitigated and could dominate the training signal.
Authors: The choice of minimal preprocessing is deliberate to maintain the original characteristics of the images from diverse sources, enabling research into methods that are robust to real-world domain shifts. We did not apply harmonization because doing so would alter the dataset's intended heterogeneity. However, we acknowledge the referee's concern about quality metrics. In the revision, we will include additional details on the basic quality control performed during aggregation, such as checks for image dimensions and basic intensity statistics, and provide cross-source summaries where available from the source metadata. Comprehensive artifact audits across all 910 sources exceed the scope of this dataset release paper, as each source has its own curation process. revision: partial
- Demonstrating that inter-source heterogeneity primarily reflects clinically relevant anatomical and pathological variability rather than unmitigated acquisition artifacts would require additional validation studies beyond the scope of this dataset release.
Circularity Check
No circularity: direct dataset aggregation with no derivations or predictions
full rationale
The paper's contribution is the release of FOMO260K, formed by aggregating 260,927 scans from 910 public sources with only minimal preprocessing. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or described content. The central claim reduces to a factual description of data collection and release rather than any self-referential or constructed result, rendering the work self-contained against external benchmarks with no load-bearing steps that reduce to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Publicly available MRI datasets can be aggregated without violating original usage terms or introducing unresolvable privacy issues.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present FOMO260K, a large-scale, heterogeneous dataset of 260,927 brain Magnetic Resonance Imaging (MRI) scans... Minimal preprocessing was applied to preserve the original image characteristics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge
Self-supervised pretraining on 60K clinical-style brain MRIs improves out-of-domain generalization on classification, segmentation, and regression tasks, with hybrid objectives and small models showing strong results.
Reference graph
Works this paper leans on
-
[1]
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009)
work page 2009
-
[2]
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A. & Torralba, A. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis Mach. Intell. (2017)
work page 2017
-
[3]
Gokaslan, A. & Cohen, V . Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus (2019). 3/5
work page 2019
-
[4]
Mueller, S. G. et al. The Alzheimer’s disease neuroimaging initiative.Neuroimaging Clin. 15, 869–877 (2005)
work page 2005
-
[5]
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018)
work page 2018
-
[6]
Marek, K. et al. The Parkinson’s progression markers initiative (PPMI)–establishing a PD biomarker cohort. Annals clinical translational neurology 5, 1460–1477 (2018)
work page 2018
-
[7]
Casey, B. J. et al. The adolescent brain cognitive development (ABCD) study: imaging acquisition across 21 sites. Dev. cognitive neuroscience 32, 43–54 (2018)
work page 2018
-
[8]
An OpenMind for 3D medical vision self-supervised learning
Wald, T.et al. An OpenMind for 3D medical vision self-supervised learning. arXiv preprint arXiv:2412.17041 (2024)
- [9]
-
[10]
Iglesias, J. E. et al. SynthSR: A public AI tool to turn heterogeneous clinical brain scans into high-resolution T1-weighted images for 3D morphometry. Sci. advances 9, eadd3607 (2023)
work page 2023
-
[11]
Rorden, C., Absher, J. & Newman-Norlund, R. Stroke Outcome Optimization Project (SOOP), DOI: doi:10.18112/ openneuro.ds004889.v1.1.2 (2024)
work page 2024
- [12]
-
[13]
Adewole, M. et al. The brain tumor segmentation (brats) challenge 2023: Glioma segmentation in sub-saharan africa patient population (brats-africa). ArXiv arXiv–2305 (2023)
work page 2023
-
[14]
Baid, U. et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification (2021). 2107.02314
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Moawad, A. W. et al. The Brain Tumor Segmentation-Metastases (BraTS-METS) Challenge 2023: Brain Metastasis Segmentation on Pre-treatment MRI. ArXiv arXiv–2306 (2024)
work page 2023
-
[16]
Kazerooni, A. F. et al. The brain tumor segmentation (BraTS) challenge 2023: focus on pediatrics (CBTN-CONNECT- DIPGR-ASNR-MICCAI BraTS-PEDs). ArXiv arXiv–2305 (2024)
work page 2023
-
[17]
Marcus, D. S. et al. Open Access Series of Imaging Studies (OASIS): cross-sectional MRI data in young, middle aged, nondemented, and demented older adults. J. cognitive neuroscience 19, 1498–1507 (2007)
work page 2007
-
[18]
Marcus, D. S., Fotenos, A. F., Csernansky, J. G., Morris, J. C. & Buckner, R. L. Open access series of imaging studies: longitudinal MRI data in nondemented and demented older adults. J. cognitive neuroscience 22, 2677–2684 (2010)
work page 2010
-
[19]
Simpson, A. L. et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1902
- [20]
-
[21]
Nugent, A. C. et al. The NIMH Healthy Research V olunteer Dataset, DOI: doi:10.18112/openneuro.ds005752.v2.1.0 (2025)
-
[22]
Park, D. et al. The Dallas Lifespan Brain Study, DOI: doi:10.18112/openneuro.ds004856.v1.1.1 (2024)
-
[23]
Taylor, P. N.et al. The Imaging Database for Epilepsy And Surgery (IDEAS), DOI: doi:10.18112/openneuro.ds005602.v1. 0.0 (2024)
-
[24]
Gibson, M. et al. Aphasia Recovery Cohort (ARC) Dataset, DOI: doi:10.18112/openneuro.ds004884.v1.0.1 (2023)
-
[25]
Seminowicz, D. et al. MBSR, DOI: doi:10.18112/openneuro.ds005016.v1.1.1 (2024)
-
[26]
Bilder, R. et al. UCLA Consortium for Neuropsychiatric Phenomics LA5c Study (2018)
work page 2018
-
[27]
Strike, L. T. et al. Queensland Twin Adolescent Brain (QTAB), DOI: doi:10.18112/openneuro.ds004146.v1.0.4 (2022)
-
[28]
Tobe, R. H. et al. A longitudinal resource for studying connectome development and its psychiatric associations during childhood. Sci. data 9, 300 (2022)
work page 2022
-
[29]
Snoek, L. et al. AOMIC-ID1000, DOI: 10.18112/openneuro.ds003097.v1.2.1 (2021)
- [30]
-
[31]
Billot, B. et al. SynthSeg: Segmentation of brain MRI scans of any contrast and resolution without retraining. Med. image analysis 86, 102789 (2023)
work page 2023
-
[32]
de Verdier, M. C.et al. The 2024 Brain Tumor Segmentation (BraTS) challenge: glioma segmentation on post-treatment MRI. arXiv preprint arXiv:2405.18368 (2024). 4/5 Acknowledgements This work has been supported by the Danish Data Science Academy, which is funded by the Novo Nordisk Foundation (grant number NNF21SA0069429) and Villum Fonden (grant number 40...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.