pith. sign in

arxiv: 2506.14432 · v3 · submitted 2025-06-17 · 📡 eess.IV · cs.CV

A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning

Pith reviewed 2026-05-19 09:49 UTC · model grok-4.3

classification 📡 eess.IV cs.CV
keywords brain MRIself-supervised learningmedical imaging datasetheterogeneous datalarge-scale datasetFOMO260K3D MRI
0
0 comments X

The pith

FOMO260K aggregates 260927 heterogeneous brain MRI scans from 910 sources to support large-scale self-supervised learning in medical imaging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FOMO260K, a massive collection of brain MRI scans drawn from many different public datasets to enable self-supervised learning at scale. It combines clinical and research images with various sequences and includes cases with significant brain anomalies. By applying only minimal preprocessing, the dataset aims to lower barriers for developing and testing self-supervised learning models tailored to medical images. A sympathetic reader would care because smaller or more uniform datasets have limited the ability to train models that generalize across real clinical variations in scanners and patient groups.

Core claim

The authors have compiled and released FOMO260K, consisting of 260927 3D brain MRI scans from 77589 sessions and 55378 subjects aggregated from 910 publicly available sources, featuring both clinical and research grade images, multiple sequences, and substantial anatomical and pathological variability including large brain anomalies, with minimal preprocessing to preserve original characteristics and support self-supervised pretraining and benchmarking at scale.

What carries the argument

The FOMO260K dataset, which aggregates minimally preprocessed scans from diverse public sources to capture wide anatomical, pathological, and acquisition variability for self-supervised model development.

If this is right

  • The provided companion code allows direct pretraining of self-supervised models on FOMO260K followed by finetuning on downstream tasks.
  • The dataset enables systematic benchmarking of self-supervised methods on data that reflects real-world heterogeneity in MRI acquisition and pathology.
  • Pretrained models released alongside the dataset can serve as initial weights for a range of brain MRI analysis applications.
  • Models developed this way are expected to handle new scans from varied protocols and populations more reliably than those trained on limited data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hospitals could adapt models pretrained on this scale of data to local scanner fleets with less need for new labeled examples.
  • Similar aggregation strategies might be applied to other imaging modalities to create cross-domain self-supervised resources.
  • The emphasis on minimal preprocessing could encourage dataset creators to prioritize accessibility over aggressive standardization in future releases.

Load-bearing premise

Aggregating minimally preprocessed scans from 910 heterogeneous public sources will yield a dataset with enough variability and quality to train generalizable self-supervised models without source-specific biases dominating.

What would settle it

An experiment in which self-supervised models pretrained on FOMO260K show no better generalization to unseen clinical datasets than models pretrained on smaller homogeneous collections would indicate the central claim does not hold.

Figures

Figures reproduced from arXiv: 2506.14432 by Asbj{\o}rn Munk, Christian Hedeager Krag, Jakob Ambsdorf, Juan Eugenio Iglesias, Julia Machnio, Mads Nielsen, Michael Eriksen Benros, Mikael Boesen, Mostafa Mehdipour Ghazi, Pablo Rocamora Garc\'ia, Peirong Liu, Sebastian N{\o}rgaard Llambias, Stefano Cerri, Vardan Nersesjan.

Figure 1
Figure 1. Figure 1: Representative examples from the FOMO60K dataset, illustrating the heterogeneity in image quality, MRI sequences, and the presence of brain anomalies. Minimal preprocessing was applied to retain the raw characteristics of the original images while improving usability. We also release code for self-supervised pretraining and fine-tuning to facilitate benchmarking, method development, and broader adoption of… view at source ↗
read the original abstract

We present FOMO260K, a large-scale, heterogeneous dataset of 260,927 brain Magnetic Resonance Imaging (MRI) scans from 77,589 MRI sessions and 55,378 subjects, aggregated from 910 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing entry barriers for new users. Companion code for self-supervised pretraining and finetuning is provided, along with pretrained models. FOMO260K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents FOMO260K, a dataset of 260,927 brain MRI scans from 77,589 sessions and 55,378 subjects aggregated from 910 public sources. It includes clinical and research-grade images across multiple sequences and a range of anatomical/pathological variability, with only minimal preprocessing applied to preserve original characteristics. Companion code for self-supervised pretraining/finetuning and pretrained models are provided, with the explicit goal of supporting large-scale SSL development and benchmarking in medical imaging.

Significance. If the dataset's scale and heterogeneity can be shown to reflect clinically relevant variability rather than unmitigated acquisition artifacts, the release would be a substantial contribution to SSL research in medical imaging, where existing public datasets are typically orders of magnitude smaller. The provision of reproducible code and pretrained models is a clear strength that lowers barriers to entry and supports immediate use.

major comments (2)
  1. [Abstract] Abstract: The central claim that FOMO260K supports 'development and benchmarking of self-supervised learning methods ... at scale' rests on the assumption that inter-source heterogeneity primarily captures anatomical and pathological variability. However, the abstract provides no quantitative source-stratified statistics (e.g., field strength distributions, sequence parameter ranges, or intensity histogram comparisons) to substantiate this, leaving open the possibility that models learn scanner-specific signatures instead.
  2. [Methods] Methods / Dataset construction: Minimal preprocessing is described as preserving original image characteristics, yet the manuscript reports no harmonization steps, artifact audits, or cross-source quality metrics. This directly impacts the weakest assumption that the aggregated data will yield generalizable SSL features; without such checks, domain shifts from acquisition protocols remain unmitigated and could dominate the training signal.
minor comments (1)
  1. [Results] The manuscript would benefit from an explicit table summarizing key acquisition metadata (field strength, scanner vendor, sequence type) broken down by source or at least by major cohorts to allow readers to assess heterogeneity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on our manuscript describing the FOMO260K dataset. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that FOMO260K supports 'development and benchmarking of self-supervised learning methods ... at scale' rests on the assumption that inter-source heterogeneity primarily captures anatomical and pathological variability. However, the abstract provides no quantitative source-stratified statistics (e.g., field strength distributions, sequence parameter ranges, or intensity histogram comparisons) to substantiate this, leaving open the possibility that models learn scanner-specific signatures instead.

    Authors: We agree that including quantitative source-stratified statistics in the abstract would better support the central claim. In the revised version, we will add a sentence summarizing key heterogeneity metrics, including the distribution of field strengths (e.g., 1.5T vs 3T) and sequence types across sources. Regarding the risk of models learning scanner-specific signatures, this is a known challenge in multi-source medical datasets; our intent is to provide data that reflects real clinical acquisition variability, which SSL methods must ultimately handle for practical deployment. We will expand the discussion section to note this limitation and suggest that users evaluate feature invariance. revision: yes

  2. Referee: [Methods] Methods / Dataset construction: Minimal preprocessing is described as preserving original image characteristics, yet the manuscript reports no harmonization steps, artifact audits, or cross-source quality metrics. This directly impacts the weakest assumption that the aggregated data will yield generalizable SSL features; without such checks, domain shifts from acquisition protocols remain unmitigated and could dominate the training signal.

    Authors: The choice of minimal preprocessing is deliberate to maintain the original characteristics of the images from diverse sources, enabling research into methods that are robust to real-world domain shifts. We did not apply harmonization because doing so would alter the dataset's intended heterogeneity. However, we acknowledge the referee's concern about quality metrics. In the revision, we will include additional details on the basic quality control performed during aggregation, such as checks for image dimensions and basic intensity statistics, and provide cross-source summaries where available from the source metadata. Comprehensive artifact audits across all 910 sources exceed the scope of this dataset release paper, as each source has its own curation process. revision: partial

standing simulated objections not resolved
  • Demonstrating that inter-source heterogeneity primarily reflects clinically relevant anatomical and pathological variability rather than unmitigated acquisition artifacts would require additional validation studies beyond the scope of this dataset release.

Circularity Check

0 steps flagged

No circularity: direct dataset aggregation with no derivations or predictions

full rationale

The paper's contribution is the release of FOMO260K, formed by aggregating 260,927 scans from 910 public sources with only minimal preprocessing. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or described content. The central claim reduces to a factual description of data collection and release rather than any self-referential or constructed result, rendering the work self-contained against external benchmarks with no load-bearing steps that reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a dataset aggregation paper with no mathematical derivations. Free parameters are limited to any implicit choices in minimal preprocessing steps. No new entities are postulated. Axioms are standard assumptions about public data availability and MRI image properties.

axioms (1)
  • domain assumption Publicly available MRI datasets can be aggregated without violating original usage terms or introducing unresolvable privacy issues.
    Invoked by the decision to combine 910 sources into one release.

pith-pipeline@v0.9.0 · 5723 in / 1353 out tokens · 30319 ms · 2026-05-19T09:49:04.459807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

    cs.CV 2026-04 conditional novelty 6.0

    Self-supervised pretraining on 60K clinical-style brain MRIs improves out-of-domain generalization on classification, segmentation, and regression tasks, with hybrid objectives and small models showing strong results.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009)

  2. [2]

    & Torralba, A

    Zhou, B., Lapedriza, A., Khosla, A., Oliva, A. & Torralba, A. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis Mach. Intell. (2017)

  3. [3]

    & Cohen, V

    Gokaslan, A. & Cohen, V . Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus (2019). 3/5

  4. [4]

    Mueller, S. G. et al. The Alzheimer’s disease neuroimaging initiative.Neuroimaging Clin. 15, 869–877 (2005)

  5. [5]

    Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018)

  6. [6]

    Marek, K. et al. The Parkinson’s progression markers initiative (PPMI)–establishing a PD biomarker cohort. Annals clinical translational neurology 5, 1460–1477 (2018)

  7. [7]

    Casey, B. J. et al. The adolescent brain cognitive development (ABCD) study: imaging acquisition across 21 sites. Dev. cognitive neuroscience 32, 43–54 (2018)

  8. [8]

    An OpenMind for 3D medical vision self-supervised learning

    Wald, T.et al. An OpenMind for 3D medical vision self-supervised learning. arXiv preprint arXiv:2412.17041 (2024)

  9. [9]

    https://fomo25.github.io/

    FOMO25. https://fomo25.github.io/

  10. [10]

    Iglesias, J. E. et al. SynthSR: A public AI tool to turn heterogeneous clinical brain scans into high-resolution T1-weighted images for 3D morphometry. Sci. advances 9, eadd3607 (2023)

  11. [11]

    & Newman-Norlund, R

    Rorden, C., Absher, J. & Newman-Norlund, R. Stroke Outcome Optimization Project (SOOP), DOI: doi:10.18112/ openneuro.ds004889.v1.1.2 (2024)

  12. [12]

    LaBella, D. et al. The asnr-miccai brain tumor segmentation (brats) challenge 2023: Intracranial meningioma. arXiv preprint arXiv:2305.07642 (2023)

  13. [13]

    Adewole, M. et al. The brain tumor segmentation (brats) challenge 2023: Glioma segmentation in sub-saharan africa patient population (brats-africa). ArXiv arXiv–2305 (2023)

  14. [14]

    Baid, U. et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification (2021). 2107.02314

  15. [15]

    Moawad, A. W. et al. The Brain Tumor Segmentation-Metastases (BraTS-METS) Challenge 2023: Brain Metastasis Segmentation on Pre-treatment MRI. ArXiv arXiv–2306 (2024)

  16. [16]

    Kazerooni, A. F. et al. The brain tumor segmentation (BraTS) challenge 2023: focus on pediatrics (CBTN-CONNECT- DIPGR-ASNR-MICCAI BraTS-PEDs). ArXiv arXiv–2305 (2024)

  17. [17]

    Marcus, D. S. et al. Open Access Series of Imaging Studies (OASIS): cross-sectional MRI data in young, middle aged, nondemented, and demented older adults. J. cognitive neuroscience 19, 1498–1507 (2007)

  18. [18]

    S., Fotenos, A

    Marcus, D. S., Fotenos, A. F., Csernansky, J. G., Morris, J. C. & Buckner, R. L. Open access series of imaging studies: longitudinal MRI data in nondemented and demented older adults. J. cognitive neuroscience 22, 2677–2684 (2010)

  19. [19]

    Simpson, A. L. et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063 (2019)

  20. [20]

    http://brain-development.org/ixi-dataset/

    IXI. http://brain-development.org/ixi-dataset/

  21. [21]

    Nugent, A. C. et al. The NIMH Healthy Research V olunteer Dataset, DOI: doi:10.18112/openneuro.ds005752.v2.1.0 (2025)

  22. [22]

    Park, D. et al. The Dallas Lifespan Brain Study, DOI: doi:10.18112/openneuro.ds004856.v1.1.1 (2024)

  23. [23]

    Taylor, P. N.et al. The Imaging Database for Epilepsy And Surgery (IDEAS), DOI: doi:10.18112/openneuro.ds005602.v1. 0.0 (2024)

  24. [24]

    Gibson, M. et al. Aphasia Recovery Cohort (ARC) Dataset, DOI: doi:10.18112/openneuro.ds004884.v1.0.1 (2023)

  25. [25]

    Seminowicz, D. et al. MBSR, DOI: doi:10.18112/openneuro.ds005016.v1.1.1 (2024)

  26. [26]

    Bilder, R. et al. UCLA Consortium for Neuropsychiatric Phenomics LA5c Study (2018)

  27. [27]

    Strike, L. T. et al. Queensland Twin Adolescent Brain (QTAB), DOI: doi:10.18112/openneuro.ds004146.v1.0.4 (2022)

  28. [28]

    Tobe, R. H. et al. A longitudinal resource for studying connectome development and its psychiatric associations during childhood. Sci. data 9, 300 (2022)

  29. [29]

    Snoek, L. et al. AOMIC-ID1000, DOI: 10.18112/openneuro.ds003097.v1.2.1 (2021)

  30. [30]

    FreeSurfer

    Fischl, B. FreeSurfer. Neuroimage 62, 774–781 (2012)

  31. [31]

    Billot, B. et al. SynthSeg: Segmentation of brain MRI scans of any contrast and resolution without retraining. Med. image analysis 86, 102789 (2023)

  32. [32]

    de Verdier, M. C.et al. The 2024 Brain Tumor Segmentation (BraTS) challenge: glioma segmentation on post-treatment MRI. arXiv preprint arXiv:2405.18368 (2024). 4/5 Acknowledgements This work has been supported by the Danish Data Science Academy, which is funded by the Novo Nordisk Foundation (grant number NNF21SA0069429) and Villum Fonden (grant number 40...