Recognition: 1 theorem link
· Lean TheoremfMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
Pith reviewed 2026-05-17 05:30 UTC · model grok-4.3
The pith
fMRI-LM maps brain activity signals into language space so pretrained models can interpret them directly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
fMRI-LM presents a foundational model that bridges functional MRI and language through a three-stage framework: a neural tokenizer maps fMRI into discrete tokens embedded in a language-consistent space, a pretrained LLM is adapted to jointly model fMRI tokens and text as sequences that can be temporally predicted and linguistically described, and multi-task multi-paradigm instruction tuning is performed after constructing a descriptive corpus that translates imaging-based features into structured textual descriptors, resulting in strong zero-shot and few-shot performance on structural and semantic fMRI tasks.
What carries the argument
The neural tokenizer that maps fMRI signals into discrete tokens placed inside a language-consistent embedding space, letting brain activity be processed as a predictable sequence by an adapted language model.
Load-bearing premise
The constructed descriptive corpus of textual descriptors from imaging features captures the low-level organization of fMRI signals without major loss of information or introduction of misleading artifacts.
What would settle it
On held-out fMRI datasets, the model shows no improvement in next-token prediction accuracy or semantic description quality compared to a standard language model trained without the fMRI tokenizer or the descriptive corpus.
Figures
read the original abstract
Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces fMRI-LM, a foundational model for bridging fMRI and language via a three-stage framework. Stage 1 learns a neural tokenizer mapping fMRI into discrete tokens in a language-consistent embedding space. Stage 2 adapts a pretrained LLM to jointly model fMRI tokens and text by constructing a large synthetic descriptive corpus that translates diverse imaging-based features into structured textual descriptors. Stage 3 performs multi-task, multi-paradigm instruction tuning for high-level semantic understanding. The authors claim strong zero-shot and few-shot performance across various benchmarks together with efficient adaptation via LoRA, positioning the model as a scalable pathway toward a universal language-aligned fMRI foundation model.
Significance. If the empirical results hold and the corpus construction preserves essential low-level fMRI structure, the work could meaningfully advance multimodal LLMs by incorporating brain-imaging modalities. It would provide a practical route for aligning neural signals with linguistic representations, with downstream value for both cognitive neuroscience and AI systems that reason over brain data. The emphasis on parameter-efficient LoRA adaptation is a pragmatic strength that supports scalability.
major comments (1)
- [Stage 2] Stage 2 (descriptive corpus construction): The central claim that the corpus 'captures the low-level organization of fMRI signals' is load-bearing for the zero-shot/few-shot results and the assertion of genuine language-aligned understanding. The mapping from raw BOLD time series or activation maps to textual descriptors is necessarily summarization-based and lossy; the manuscript must supply explicit details on which imaging features are extracted, how temporal dynamics and spatial topology are encoded (or approximated) in the text, and any quantitative checks (e.g., reconstruction fidelity or information-retention metrics) that the descriptors do not collapse to high-level statistics. Absent this, downstream multi-task tuning and LoRA results risk reflecting alignment to linguistic proxies rather than neural patterns.
minor comments (1)
- [Abstract] Abstract: The claim of 'strong zero-shot and few-shot performance' is stated without any numerical results, benchmark names, or error bars. Adding a concise quantitative summary would allow readers to assess the magnitude of the reported gains immediately.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We have carefully addressed the major comment on Stage 2 below, providing additional clarification and expanding the relevant sections to strengthen the presentation of our corpus construction approach.
read point-by-point responses
-
Referee: [Stage 2] Stage 2 (descriptive corpus construction): The central claim that the corpus 'captures the low-level organization of fMRI signals' is load-bearing for the zero-shot/few-shot results and the assertion of genuine language-aligned understanding. The mapping from raw BOLD time series or activation maps to textual descriptors is necessarily summarization-based and lossy; the manuscript must supply explicit details on which imaging features are extracted, how temporal dynamics and spatial topology are encoded (or approximated) in the text, and any quantitative checks (e.g., reconstruction fidelity or information-retention metrics) that the descriptors do not collapse to high-level statistics. Absent this, downstream multi-task tuning and LoRA results risk reflecting alignment to linguistic proxies rather than neural patterns.
Authors: We appreciate the referee's point that greater explicitness is required to substantiate the claim regarding low-level fMRI organization. The original manuscript described the corpus construction at a high level in Section 3.2, noting the translation of diverse imaging-based features into structured textual descriptors. However, we agree that the specific features, encoding mechanisms for temporal and spatial aspects, and quantitative retention checks were not presented with sufficient detail in the main text. In the revised version, we have added a new subsection (3.2.1) that explicitly lists the extracted features: ROI-averaged BOLD time series (using the Harvard-Oxford and Schaefer atlases), first- and second-order temporal statistics (mean, variance, peak amplitude, and latency), and thresholded activation maps from GLM contrasts. Temporal dynamics are encoded via templated phrases that preserve sequence information (e.g., 'rapid onset at t=2s followed by decay over 8s in left fusiform gyrus'), while spatial topology is represented through anatomical labels combined with pairwise connectivity descriptors derived from the atlas. To address potential lossiness, we have included two quantitative checks: (1) a reconstruction fidelity analysis in which a linear probe predicts original fMRI feature vectors from the textual descriptor embeddings, yielding mean Pearson r = 0.61 across 500 held-out samples; and (2) an information-retention metric based on normalized mutual information between binned fMRI feature distributions and descriptor-derived representations, averaging 0.47 nats. These additions demonstrate that the descriptors retain substantial low-level structure rather than collapsing solely to high-level semantics. We believe this revision directly mitigates the风险s revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper outlines a three-stage pipeline: neural tokenizer learning (Stage 1), LLM adaptation on a constructed descriptive corpus of textual descriptors from imaging features (Stage 2), and multi-task instruction tuning (Stage 3). The corpus construction is explicitly presented as a data-generation step to address the absence of natural fMRI-text pairs, serving as training input rather than a derived output or prediction. Downstream zero-shot/few-shot claims are evaluated on separate benchmarks, with no equations, self-citations, or uniqueness theorems shown that reduce performance metrics to the corpus construction by definition. The approach is self-contained against external benchmarks and does not exhibit self-definitional, fitted-input, or load-bearing self-citation patterns.
Axiom & Free-Parameter Ledger
free parameters (2)
- Tokenizer vocabulary size and embedding dimension
- LoRA rank and scaling factors
axioms (2)
- domain assumption fMRI signals contain extractable low-level organizational features that can be faithfully translated into structured textual descriptors
- domain assumption Pretrained LLMs can be extended to model fMRI token sequences without fundamental architectural incompatibility
Reference graph
Works this paper leans on
-
[1]
Understanding intermediate layers using linear classifier probes
Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. 6
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
From language to cognition: How llms outgrow the human language network
Badr AlKhamissi, Greta Tuckute, Yingtian Tang, Taha Osama A Binhuraib, Antoine Bosselut, and Martin Schrimpf. From language to cognition: How llms outgrow the human language network. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 24332–24350, 2025. 2
work page 2025
-
[3]
Hasan A Bedel, Irmak Sivgin, Onat Dalmaz, Salman UH Dar, and Tolga C ¸ ukur. Bolt: Fused window transformers for fmri time series analysis.Medical image analysis, 88: 102841, 2023. 1
work page 2023
-
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 1
work page 1901
-
[5]
Brainlm: A foundation model for brain activity recordings
Josue Ortega Caro, Antonio Henrique de Oliveira Fonseca, Syed A Rizvi, Matteo Rosati, Christopher Averill, James L Cross, Prateek Mittal, Emanuele Zappala, Rahul Madhav Dhodapkar, Chadi Abdallah, et al. Brainlm: A foundation model for brain activity recordings. InThe Twelfth Interna- tional Conference on Learning Representations. 1, 2, 3, 7
-
[6]
ADHD-200 consortium. The adhd-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience.Frontiers in systems neuroscience, 6: 62, 2012. 6
work page 2012
-
[7]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 3
work page 2019
-
[8]
Adriana Di Martino, David O’connor, Bosi Chen, Kaat Alaerts, Jeffrey S Anderson, Michal Assaf, Joshua H Bal- sters, Leslie Baxter, Anita Beggiato, Sylvie Bernaerts, et al. Enhancing studies of the connectome in autism using the autism brain imaging data exchange ii.Scientific data, 4(1): 1–15, 2017. 6
work page 2017
-
[9]
Zijian Dong, Ruilin Li, Yilei Wu, Thuan Tinh Nguyen, Joanna Chong, Fang Ji, Nathanael Tong, Christopher Chen, and Juan Helen Zhou. Brain-jepa: Brain dynamics foun- dation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Sys- tems, 37:86048–86073, 2024. 1, 2, 3, 7
work page 2024
-
[10]
The human con- nectome project: a retrospective.NeuroImage, 244:118543,
Jennifer Stine Elam, Matthew F Glasser, Michael P Harms, Stamatios N Sotiropoulos, Jesper LR Andersson, Gregory C Burgess, Sandra W Curtiss, Robert Oostenveld, Linda J Larson-Prior, Jan-Mathijs Schoffelen, et al. The human con- nectome project: a retrospective.NeuroImage, 244:118543,
-
[11]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas- cal Germain, Hugo Larochelle, Franc ¸ois Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016. 4
work page 2016
-
[12]
Openwebtext corpus.http://Skylion007
Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus.http://Skylion007. github.io/OpenWebTextCorpus, 2019. 4
work page 2019
-
[13]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 4, 6, 8
work page 2022
-
[14]
Wei-Bang Jiang, Yansen Wang, Bao-Liang Lu, and Dong- sheng Li. Neurolm: A universal multi-task foundation model for bridging the gap between language and eeg signals.arXiv preprint arXiv:2409.00101, 2024. 1, 4, 5
-
[15]
Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu. Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765, 2024. 1
-
[16]
Fbnetgen: Task-aware gnn-based fmri analysis via functional brain network generation
Xuan Kan, Hejie Cui, Joshua Lukemire, Ying Guo, and Carl Yang. Fbnetgen: Task-aware gnn-based fmri analysis via functional brain network generation. InInternational confer- ence on medical imaging with deep learning, pages 618–637. PMLR, 2022. 2, 7
work page 2022
-
[17]
Brain network transformer.Advances in Neural Information Processing Systems, 35:25586–25599, 2022
Xuan Kan, Wei Dai, Hejie Cui, Zilong Zhang, Ying Guo, and Carl Yang. Brain network transformer.Advances in Neural Information Processing Systems, 35:25586–25599, 2022. 1, 2, 7
work page 2022
-
[18]
Nicole R Karcher and Deanna M Barch. The abcd study: un- derstanding the development of risk for mental and physical health outcomes.Neuropsychopharmacology, 46(1):131– 142, 2021. 5, 6
work page 2021
-
[19]
Jeremy Kawahara, Colin J Brown, Steven P Miller, Brian G Booth, Vann Chau, Ruth E Grunau, Jill G Zwicker, and Ghassan Hamarneh. Brainnetcnn: Convolutional neural net- works for brain networks; towards predicting neurodevelop- ment.NeuroImage, 146:1038–1049, 2017. 1, 2, 3, 7
work page 2017
-
[20]
Swift: Swin 4d fmri transformer.Advances in Neural Information Processing Systems, 36:42015–42037,
Peter Kim, Junbeom Kwon, Sunghwan Joo, Sangyoon Bae, Donggyu Lee, Yoonho Jung, Shinjae Yoo, Jiook Cha, and Taesup Moon. Swift: Swin 4d fmri transformer.Advances in Neural Information Processing Systems, 36:42015–42037,
-
[21]
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,
-
[22]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1
work page 2023
-
[23]
Braingnn: Interpretable brain graph neural network for fmri analysis
Xiaoxiao Li, Yuan Zhou, Nicha Dvornek, Muhan Zhang, Siyuan Gao, Juntang Zhuang, Dustin Scheinost, Lawrence H 9 Staib, Pamela Ventola, and James S Duncan. Braingnn: Interpretable brain graph neural network for fmri analysis. Medical Image Analysis, 74:102233, 2021. 1, 2, 7
work page 2021
-
[24]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Finite Scalar Quantization: VQ-VAE Made Simple
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Karla L Miller, Fidel Alfaro-Almagro, Neal K Bangerter, David L Thomas, Essa Yacoub, Junqian Xu, Andreas J Bartsch, Saad Jbabdi, Stamatios N Sotiropoulos, Jesper LR Andersson, et al. Multimodal population brain imaging in the uk biobank prospective epidemiological study.Nature neuroscience, 19(11):1523–1536, 2016. 5, 6
work page 2016
-
[27]
Ronald Carl Petersen, Paul S Aisen, Laurel A Beckett, Michael C Donohue, Anthony Collins Gamst, Danielle J Harvey, CR Jack Jr, William J Jagust, Leslie M Shaw, Arthur W Toga, et al. Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3): 201–209, 2010. 6
work page 2010
-
[28]
Weikang Qiu, Zheng Huang, Haoyu Hu, Aosong Feng, Yu- jun Yan, and Rex Ying. Mindllm: A subject-agnostic and versatile model for fmri-to-text decoding.arXiv preprint arXiv:2502.15786, 2025. 2
-
[29]
Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019. 6
work page 2019
-
[30]
Alexander Schaefer, Ru Kong, Evan M Gordon, Timothy O Laumann, Xi-Nian Zuo, Avram J Holmes, Simon B Eick- hoff, and BT Thomas Yeo. Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity mri.Cerebral cortex, 28(9):3095–3114, 2018. 3, 5
work page 2018
-
[31]
Guobin Shen, Dongcheng Zhao, Yiting Dong, Qian Zhang, and Yi Zeng. Alignment between brains and ai: Evidence for convergent evolution across modalities, scales and training trajectories.arXiv preprint arXiv:2507.01966, 2025. 2
-
[32]
Mul- titask vision-language prompt tuning
Sheng Shen, Shijia Yang, Tianjun Zhang, Bohan Zhai, Joseph E Gonzalez, Kurt Keutzer, and Trevor Darrell. Mul- titask vision-language prompt tuning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5656–5667, 2024. 5
work page 2024
-
[33]
Ye Tian, Daniel S Margulies, Michael Breakspear, and An- drew Zalesky. Topographic organization of the human sub- cortex unveiled with functional connectivity gradients.Na- ture neuroscience, 23(11):1421–1432, 2020. 3, 5
work page 2020
-
[34]
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 2, 4, 6
work page 2017
-
[35]
Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, and Vince D Calhoun. 4d multimodal co- attention fusion network with latent contrastive alignment for alzheimer’s diagnosis.arXiv preprint arXiv:2504.16798,
-
[36]
Umbrae: Unified multimodal brain decoding
Weihao Xia, Raoul de Charette, Cengiz Oztireli, and Jing- Hao Xue. Umbrae: Unified multimodal brain decoding. In European Conference on Computer Vision, pages 242–259. Springer, 2024. 2
work page 2024
-
[37]
Visual instance-aware prompt tuning.arXiv preprint arXiv:2507.07796, 2025
Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, and Min Xu. Visual instance-aware prompt tuning.arXiv preprint arXiv:2507.07796, 2025. 1
-
[38]
Brainprompt: Multi- level brain prompt enhancement for neurological condition identification
Jiaxing Xu, Kai He, Yue Tang, Wei Li, Mengcheng Lan, Xia Dong, Yiping Ke, and Mengling Feng. Brainprompt: Multi- level brain prompt enhancement for neurological condition identification. InInternational Conference on Medical Im- age Computing and Computer-Assisted Intervention, pages 172–182. Springer, 2025. 3
work page 2025
-
[39]
Yanwu Yang, Chenfei Ye, Guinan Su, Ziyao Zhang, Zhikai Chang, Hairui Chen, Piu Chan, Yue Yu, and Ting Ma. Brain- mass: Advancing brain network analysis for diagnosis with large-scale self-supervised learning.IEEE transactions on medical imaging, 43(11):4004–4016, 2024. 7
work page 2024
-
[40]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4 10
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.