Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG
Pith reviewed 2026-05-21 11:24 UTC · model grok-4.3
The pith
Brain-OF jointly processes fMRI, EEG and MEG in one model by mapping them to a shared semantic space and using dual-domain pretraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Brain-OF is an omnifunctional brain foundation model jointly pretrained on fMRI, EEG and MEG, capable of handling both unimodal and multimodal inputs within a unified framework. It introduces the Any-Resolution Neural Signal Sampler to project diverse brain signals into a shared semantic space, integrates DINT attention with a Sparse Mixture of Experts to capture invariant and specific representations, and uses Masked Temporal-Frequency Modeling as a dual-domain pretraining objective. Pretrained on around 40 datasets, it demonstrates superior performance across diverse downstream tasks.
What carries the argument
Any-Resolution Neural Signal Sampler that projects signals of varying resolutions into a shared semantic space, enabling unified processing across modalities.
If this is right
- The model achieves superior performance on diverse downstream tasks compared to existing single-modality approaches.
- Joint multimodal integration allows exploitation of complementary spatiotemporal dynamics across neuroimaging techniques.
- Dual-domain pretraining helps internalize characteristics of neural activity through self-supervised reconstruction in time and frequency domains.
- Shared experts capture modality-invariant representations while routed experts handle modality-specific semantics.
Where Pith is reading between the lines
- The approach could support transfer of learned representations between different neuroimaging modalities.
- It opens the possibility of developing more robust models for brain-computer interfaces that fuse multiple signal types.
- Similar resolution alignment techniques might apply to other heterogeneous sensor data in scientific domains.
Load-bearing premise
The Any-Resolution Neural Signal Sampler successfully projects diverse brain signals with severe semantic heterogeneity and resolution discrepancies into a shared semantic space while preserving task-relevant information.
What would settle it
If a single-modality model trained only on fMRI achieves comparable or better results than Brain-OF on tasks involving EEG or MEG data, this would challenge the value of the joint multimodal pretraining.
Figures
read the original abstract
Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics and the collective data scale across different neuroimaging techniques. This limitation largely arises from severe semantic heterogeneity and resolution discrepancies among modalities. To address these challenges, we propose Brain-OF, an omnifunctional brain foundation model jointly pretrained on fMRI, EEG and MEG, capable of handling both unimodal and multimodal inputs within a unified framework. To reconcile heterogeneous spatiotemporal resolutions, we introduce the Any-Resolution Neural Signal Sampler, which projects diverse brain signals into a shared semantic space. To further manage semantic shifts, the Brain-OF backbone integrates DINT attention with a Sparse Mixture of Experts, where shared experts capture modality-invariant representations and routed experts specialize in modality-specific semantics. Furthermore, to explicitly internalize the characteristics of neural activity through self-supervised learning, we propose Masked Temporal-Frequency Modeling, a dual-domain pretraining objective that jointly reconstructs brain signals in both the time and frequency domains. Brain-OF is pretrained on a large-scale corpus comprising around 40 datasets and demonstrates superior performance across diverse downstream tasks, highlighting the benefits of joint multimodal integration and dual-domain pretraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Brain-OF, an omnifunctional foundation model jointly pretrained on fMRI, EEG, and MEG data from approximately 40 datasets. It introduces the Any-Resolution Neural Signal Sampler to project signals with heterogeneous resolutions and semantics into a shared space, integrates DINT attention with a Sparse Mixture of Experts backbone (shared experts for invariant features, routed experts for modality-specific semantics), and uses Masked Temporal-Frequency Modeling as a dual-domain self-supervised objective. The central claim is that this multimodal joint pretraining yields superior performance on diverse downstream tasks compared to single-modality approaches.
Significance. If the empirical results and ablations hold, the work would be significant for the field of brain foundation models. It directly addresses the limitation of modality-specific models by demonstrating scalable joint pretraining across fMRI, EEG, and MEG, which could enable better exploitation of complementary spatiotemporal information. The dual-domain pretraining and MoE design are potentially generalizable contributions if supported by rigorous controls.
major comments (1)
- [§3.2] §3.2 (Any-Resolution Neural Signal Sampler): The manuscript does not report isolated validation experiments, such as per-modality reconstruction fidelity, mutual information between projected representations and task labels, or controlled ablations that disable the sampler while keeping the rest of the pipeline fixed. This component is load-bearing for the central claim, because without evidence that it maps signals with severe semantic heterogeneity and resolution discrepancies into a shared space without substantial information loss, the reported benefits of joint pretraining and the DINT+MoE backbone cannot be confidently attributed to multimodal integration rather than superficial invariants.
minor comments (2)
- [Abstract] The abstract asserts superior performance across downstream tasks but does not include any quantitative metrics, baselines, error bars, or dataset statistics; while the full manuscript presumably contains these in the experiments section, their absence from the summary reduces immediate evaluability.
- [§4] Notation for the DINT attention mechanism and the routing in the Sparse Mixture of Experts could be clarified with an explicit equation or diagram in the methods section to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the major comment point by point below and have incorporated revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Any-Resolution Neural Signal Sampler): The manuscript does not report isolated validation experiments, such as per-modality reconstruction fidelity, mutual information between projected representations and task labels, or controlled ablations that disable the sampler while keeping the rest of the pipeline fixed. This component is load-bearing for the central claim, because without evidence that it maps signals with severe semantic heterogeneity and resolution discrepancies into a shared space without substantial information loss, the reported benefits of joint pretraining and the DINT+MoE backbone cannot be confidently attributed to multimodal integration rather than superficial invariants.
Authors: We agree that providing isolated validation for the Any-Resolution Neural Signal Sampler is necessary to rigorously support attribution of gains to multimodal integration. In the revised manuscript we add per-modality reconstruction fidelity metrics (MSE and cosine similarity) for fMRI, EEG, and MEG after projection into the shared space. We further report mutual information between the projected embeddings and downstream task labels across representative datasets. Finally, we include a controlled ablation that replaces the sampler with naive zero-padding and linear interpolation while freezing all other components (DINT attention, Sparse MoE, and Masked Temporal-Frequency Modeling). The ablation shows a clear performance drop on cross-modal and multimodal downstream tasks, confirming that the sampler preserves task-relevant information and enables effective joint pretraining rather than relying on superficial invariants. These results are reported in an expanded §3.2 and new supplementary tables. revision: yes
Circularity Check
No significant circularity; derivation chain is self-contained with independent empirical claims
full rationale
The paper motivates its components (Any-Resolution Neural Signal Sampler, DINT+MoE backbone, Masked Temporal-Frequency Modeling) as architectural responses to modality heterogeneity and then reports empirical gains from pretraining on ~40 datasets. No equations, fitted parameters renamed as predictions, or self-citations are shown that reduce the central performance claim to a tautology or input by construction. The sampler projects signals into a shared space but is not defined in terms of the downstream task outcomes it is claimed to enable; results remain externally falsifiable via replication. This is the normal case of a non-circular empirical foundation model paper.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Any-Resolution Neural Signal Sampler... projects diverse brain signals into a shared semantic space... Masked Temporal-Frequency Modeling, a dual-domain pretraining objective that jointly reconstructs brain signals in both the time and frequency domains.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Sparse Mixture of Experts... shared experts capture modality-invariant representations and routed experts specialize in modality-specific semantics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Dint transformer.arXiv preprint arXiv:2501.17486,
Cang, Y ., Liu, Y ., Zhang, X., Zhao, E., and Shi, L. Dint transformer.arXiv preprint arXiv:2501.17486,
-
[2]
Caro, J. O., Fonseca, A. H. d. O., Averill, C., Rizvi, S. A., Rosati, M., Cross, J. L., Mittal, P., Zappala, E., Levine, D., Dhodapkar, R. M., et al. Brainlm: A foundation model for brain activity recordings.bioRxiv, pp. 2023–09,
work page 2023
-
[3]
Chowdhury, N. S., Bi, C., Furman, A. J., Chiang, A. K., Skippen, P., Si, E., Millard, S. K., Margerison, S. M., Spies, D., Keaser, M. L., Silva, J. T. D., Chen, S., Schabrun, S. M., and Seminowicz, D. A. ”predict”. 2025a. doi: doi:10.18112/openneuro.ds005486.v1.0.1. Chowdhury, N. S., Bi, C., Furman, A. J., Chiang, A. K., Skippen, P., Si, E., Millard, S. K...
- [4]
-
[5]
Jiang, W.-B., Zhao, L.-M., and Lu, B.-L
URL https: //doi.org/10.34973/7q0a-vj19. Jiang, W.-B., Zhao, L.-M., and Lu, B.-L. Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765,
-
[6]
Jiang, W.-B., Fu, X., Ding, Y ., and Guan, C. To- wards robust multimodal physiological foundation mod- els: Handling arbitrary missing modalities.arXiv preprint arXiv:2504.19596, 2025a. Jiang, W.-B., Liu, X.-H., Zheng, W.-L., and Lu, B.-L. Seed- vii: A multimodal dataset of six basic emotions with continuous labels for emotion recognition.IEEE Trans- act...
-
[7]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[8]
Liu, W., Qiu, J.-L., Zheng, W.-L., and Lu, B.-L
doi: 10.1109/ TAFFC.2025.3572504. Liu, W., Qiu, J.-L., Zheng, W.-L., and Lu, B.-L. Comparing recognition performance and robustness of multimodal deep learning models for multimodal emotion recogni- tion.IEEE Transactions on Cognitive and Developmental Systems,
-
[9]
URL https://dx.doi.org/10.1088/ 1741-2552/ad546d
doi: 10.1088/1741-2552/ ad546d. URL https://dx.doi.org/10.1088/ 1741-2552/ad546d. Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne.Journal of machine learning research, 9(Nov): 2579–2605,
-
[10]
Song, Y ., Jia, X., Yang, L., and Xie, L. Transformer-based spatial-temporal feature learning for eeg decoding.arXiv preprint arXiv:2106.11170,
-
[11]
U., Kreiman, G., Katz, B., Cases, I., and Barbu, A
Wang, C., Subramaniam, V ., Yaari, A. U., Kreiman, G., Katz, B., Cases, I., and Barbu, A. Brainbert: Self- supervised representation learning for intracranial record- ings.arXiv preprint arXiv:2302.14367,
-
[12]
Wang, J., Zhao, S., Luo, Z., Zhou, Y ., Jiang, H., Li, S., Li, T., and Pan, G. Cbramod: A criss-cross brain foundation model for eeg decoding.arXiv preprint arXiv:2412.07236, 2024a. Wang, L., Gao, H., Zhao, C., Sun, X., and Dai, D. Auxiliary- loss-free load balancing strategy for mixture-of-experts. arXiv preprint arXiv:2408.15664, 2024b. Wei, D., Zhuang,...
-
[13]
Brainomni: A brain foundation model for unified eeg and meg signals
Xiao, Q., Cui, Z., Zhang, C., Chen, S., Wu, W., Thwaites, A., Woolgar, A., Zhou, B., and Zhang, C. Brainomni: A brain foundation model for unified eeg and meg signals. arXiv preprint arXiv:2505.18185,
-
[14]
Differential transformer.arXiv preprint arXiv:2410.05258,
Ye, T., Dong, L., Xia, Y ., Sun, Y ., Zhu, Y ., Huang, G., and Wei, F. Differential transformer.arXiv preprint arXiv:2410.05258,
-
[15]
doi: 10.1109/TAMD.2015.2431497. Zheng, W.-L. and Lu, B.-L. A multimodal ap- proach to estimating vigilance using eeg and fore- head eog.Journal of Neural Engineering, 14(2): 026017,
-
[16]
Zhou, Y ., Wu, J., Ren, Z., Yao, Z., Lu, W., Peng, K., Zheng, Q., Song, C., Ouyang, W., and Gou, C
URL http://stacks.iop.org/ 1741-2552/14/i=2/a=026017. Zhou, Y ., Wu, J., Ren, Z., Yao, Z., Lu, W., Peng, K., Zheng, Q., Song, C., Ouyang, W., and Gou, C. Csbrain: A cross-scale spatiotemporal brain foundation model for eeg decoding.arXiv preprint arXiv:2506.23075,
-
[17]
Data Preprocessing If the public datasets provide official preprocessed data, we use them directly
A.1. Data Preprocessing If the public datasets provide official preprocessed data, we use them directly. Otherwise, we apply minimal, modality- specific preprocessing pipelines to remove artifacts. Raw EEG and MEG signals are preprocessed using MNE-Python (Gramfort et al., 2014), and fMRI data are preprocessed with fMRIPrep (Esteban et al., 2019). After p...
work page 2014
-
[18]
is a large-scale longitudinal resting-state EEG dataset collected from 608 participants (ages 20-70), with 208 participants returning for a follow-up session approximately 5 years later. High- density rs-EEG was acquired using a 64-channel elastic cap (10–20 system) at 1,000 Hz sampling rate. Recordings were obtained both before and after a 2-hour battery...
work page 2015
-
[19]
andSEED-SD(Li et al., 2025). All datasets were recorded using a 62-channel ESI NeuroScan system at a 1,000 Hz sampling rate during emotion-related paradigms spanning positive, negative and neutral valence, as well as more fine-grained affective states (e.g., amusement, fear, sadness). In total, the SEED Series provides EEG recordings from 136 unique parti...
work page 2025
-
[20]
is a high-quality motor imagery EEG dataset comprising recordings from 62 healthy participants (ages 17–30) across three sessions. For pretraining, we incorporate the officially released preprocessed signals: 4-second trials recorded with 58 EEG channels. • ABIDEis an international collaborative initiative that aggregates resting-state fMRI, structural MR...
work page 2014
-
[21]
is a large-scale, high-quality pain and placebo analgesia fMRI collection comprising 395 healthy adults (age 30–43 years). Each participant completed four fMRI runs during a well-controlled thermal pain and placebo manipulation paradigm, plus anatomical scans and field maps. • QTIM(Blokland et al., 2011; Sinclair et al.,
work page 2011
-
[22]
is an emotion recognition EEG benchmark collected from 20 participants using a 62-channel ESI NeuroScan system at a 1,000 Hz sampling rate. It includes five emotion categories (happy, sad, fear, disgust and neutral) recorded across three sessions. Each session contains 15 trials, yielding a total of 117,744 1-second samples. Following the conventional str...
work page 2009
-
[23]
Performance drops sharply at a low mask ratio of 0.4 across all modalities and metrics, so we do not explore ratios below this threshold. For TUAB and ADNI, a ratio of 0.7 yields the strongest overall results (e.g., 81.88% BAC on TUAB, 68.23% BAC on ADNI), though 0.8 performs comparably on ADNI (74.53% AUROC). In contrast, ratios of 0.6 and 0.5 are superi...
work page 1965
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.