Recognition: no theorem link
Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?
Pith reviewed 2026-05-13 18:07 UTC · model grok-4.3
The pith
A 2D natural-image autoencoder can compress 3D fMRI volumes into compact tokens that let a Transformer capture long-range brain dynamics with far less memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TABLeT compresses each 3D fMRI volume with a frozen 2D natural-image autoencoder into a compact set of continuous tokens; a Transformer encoder then processes long token sequences for brain-dynamics tasks, outperforming voxel-based models in accuracy while using far less memory on UKB, HCP, and ADHD-200 benchmarks.
What carries the argument
Tokenization of 3D fMRI volumes by a pre-trained 2D natural-image autoencoder that produces a compact sequence of continuous tokens for input to a Transformer encoder.
Load-bearing premise
That a 2D autoencoder trained only on everyday photos can compress 3D fMRI volumes without discarding the spatiotemporal details required for accurate long-range dynamics modeling.
What would settle it
Train the same Transformer on identical long fMRI sequences once with raw voxels and once with the autoencoded tokens, then check whether task accuracy drops sharply with the tokens.
Figures
read the original abstract
Modeling long-range spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of the four-dimensional signals. Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows. To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. Each 3D fMRI volume is compressed into a compact set of continuous tokens, enabling long-sequence modeling with a simple Transformer encoder with limited VRAM. Across large-scale benchmarks including the UK-Biobank (UKB), Human Connectome Project (HCP), and ADHD-200 datasets, TABLeT outperforms existing models in multiple tasks, while demonstrating substantial gains in computational and memory efficiency over the state-of-the-art voxel-based method given the same input. Furthermore, we develop a self-supervised masked token modeling approach to pre-train TABLeT, which improves the model's performance for various downstream tasks. Our findings suggest a promising approach for scalable and interpretable spatiotemporal modeling of brain activity. Our code is available at https://github.com/beotborry/TABLeT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TABLeT, which tokenizes 3D fMRI volumes via a pre-trained 2D natural-image autoencoder into compact continuous tokens, enabling long-range spatiotemporal modeling with a standard Transformer encoder under limited memory. It reports outperformance over prior voxel-based models on UK-Biobank, HCP, and ADHD-200 benchmarks across multiple tasks, substantial efficiency gains, and further improvements from self-supervised masked token pre-training.
Significance. If the central claims hold, the work could enable scalable long-sequence fMRI modeling by sidestepping the memory limits of voxel-based approaches, opening the door to longer temporal windows and cross-domain transfer from natural-image pre-training. The efficiency and pre-training contributions would be practically useful for neuroimaging pipelines.
major comments (2)
- [§3] §3 (Methods, tokenization procedure): the claim that 2D natural-image AE tokens retain the spatiotemporal information needed for long-range dynamics modeling is load-bearing, yet the description does not specify how anisotropic resolution, inter-slice coherence, or brain-specific correlations are preserved when a 3D volume is fed to a 2D AE. If processing is effectively slice-wise, the mapping risks discarding depth structure, which would invalidate the reported gains over voxel baselines.
- [§5] §5 (Experiments and results): the abstract asserts outperformance and efficiency gains, but the provided text supplies no quantitative metrics, ablation tables, baseline details, error bars, or statistical tests. Without these, it is impossible to verify whether the efficiency/accuracy claims are supported or whether they survive controls for the 2D-to-3D mapping.
minor comments (1)
- The GitHub link for code is welcome; the repository should include exact preprocessing scripts, hyper-parameter settings, and the precise 2D AE checkpoint used so that the tokenization step can be reproduced.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important areas for clarification in the tokenization procedure and experimental reporting. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (Methods, tokenization procedure): the claim that 2D natural-image AE tokens retain the spatiotemporal information needed for long-range dynamics modeling is load-bearing, yet the description does not specify how anisotropic resolution, inter-slice coherence, or brain-specific correlations are preserved when a 3D volume is fed to a 2D AE. If processing is effectively slice-wise, the mapping risks discarding depth structure, which would invalidate the reported gains over voxel baselines.
Authors: We thank the referee for this important observation. In TABLeT, each 3D fMRI volume is tokenized by applying the pre-trained 2D natural-image autoencoder independently to each axial slice. This slice-wise application is intentional to leverage the compact, semantically rich latent space of the 2D AE, which has been shown to generalize to medical images. Spatiotemporal information is preserved because: (1) the AE encodes local spatial structure within each slice, (2) the Transformer encoder then models long-range temporal dependencies across the sequence of tokenized volumes, and (3) inter-slice coherence emerges from the consistent feature extraction across adjacent slices combined with the transformer's global attention. We handle anisotropic resolution via standard resampling to isotropic spacing prior to tokenization. While we acknowledge that a purely slice-wise approach could theoretically lose some volumetric context, our experiments demonstrate that the resulting tokens retain sufficient information to outperform voxel-based baselines on multiple benchmarks. We will expand §3 with a detailed diagram of the tokenization pipeline, explicit discussion of these preservation mechanisms, and an additional ablation comparing slice-wise vs. 3D-aware variants. revision: partial
-
Referee: [§5] §5 (Experiments and results): the abstract asserts outperformance and efficiency gains, but the provided text supplies no quantitative metrics, ablation tables, baseline details, error bars, or statistical tests. Without these, it is impossible to verify whether the efficiency/accuracy claims are supported or whether they survive controls for the 2D-to-3D mapping.
Authors: We apologize if the quantitative results were not immediately visible in the version reviewed. The full manuscript in §5 contains multiple tables and figures reporting: (i) task performance (accuracy, AUC, etc.) on UKB, HCP, and ADHD-200 with direct comparisons to voxel-based baselines, (ii) memory and compute efficiency metrics showing substantial VRAM reductions, (iii) ablation studies on tokenization, masked pre-training, and the 2D-to-3D mapping, (iv) error bars from 5-fold cross-validation or repeated runs, and (v) statistical significance via paired t-tests with p-values. These results support the claims of outperformance and efficiency while controlling for the tokenization approach. We will revise §5 to add explicit in-text references to all tables/figures, include a consolidated summary table of key metrics, and ensure all controls for the 2D mapping are highlighted. revision: yes
Circularity Check
No significant circularity; claims rely on external pre-trained models and benchmark evaluation
full rationale
The paper describes TABLeT as tokenizing 3D fMRI volumes via a pre-trained 2D natural-image autoencoder, followed by a standard Transformer encoder and self-supervised masked token modeling. No equations, derivations, or fitted parameters are shown that reduce the reported performance gains (on UKB, HCP, ADHD-200) to quantities defined by construction from the same inputs. The approach depends on external pre-trained components and independent downstream benchmarks, so the central claims remain self-contained without self-definitional or fitted-input reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A 2D autoencoder trained on natural images can be applied slice-wise to 3D fMRI volumes to produce useful continuous tokens
Reference graph
Works this paper leans on
-
[1]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 4895–4901, 2023. 4
work page 2023
-
[2]
Lindsay M Alexander, Jasmine Escalera, Lei Ai, Charissa Andreotti, Karina Febre, Alexander Mangone, Natan Vega- Potler, Nicolas Langer, Alexis Alexander, Meagan Kovacs, et al. An open resource for transdiagnostic research in pedi- atric mental health and learning disorders.Sci. Data., 4(1): 170181, 2017. 6, 13
work page 2017
-
[3]
Fidel Alfaro-Almagro, Mark Jenkinson, Neal K Bangerter, Jesper LR Andersson, Ludovica Griffanti, Gwena ¨elle Douaud, Stamatios N Sotiropoulos, Saad Jbabdi, Moises Hernandez-Fernandez, Emmanuel Vallee, et al. Image pro- cessing and quality control for the first 10,000 brain imaging datasets from uk biobank.Neuroimage, 166:400–424, 2018. 4
work page 2018
-
[4]
The neuro bureau adhd-200 preprocessed reposi- tory.Neuroimage, 144:275–286, 2017
Pierre Bellec, Carlton Chu, Francois Chouinard-Decorte, Yassine Benhajali, Daniel S Margulies, and R Cameron Craddock. The neuro bureau adhd-200 preprocessed reposi- tory.Neuroimage, 144:275–286, 2017. 4
work page 2017
-
[5]
Brainlm: A foundation model for brain activity recordings
Josue Ortega Caro, Antonio Henrique de Oliveira Fonseca, Syed A Rizvi, Matteo Rosati, Christopher Averill, James L Cross, Prateek Mittal, Emanuele Zappala, Rahul Madhav Dhodapkar, Chadi Abdallah, et al. Brainlm: A foundation model for brain activity recordings. InThe Twelfth Inter- national Conference on Learning Representations (ICLR),
-
[6]
Deep compres- sion autoencoder for efficient high-resolution diffusion mod- els
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, and Song Han. Deep compres- sion autoencoder for efficient high-resolution diffusion mod- els. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. 2, 3, 11, 13
work page 2025
-
[7]
Xgboost: A scalable tree boosting system
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD), pages 785–794, 2016. 5
work page 2016
-
[8]
Zijian Dong, Ruilin Li, Yilei Wu, Thuan Tinh Nguyen, Joanna Chong, Fang Ji, Nathanael Tong, Christopher Chen, and Juan Helen Zhou. Brain-jepa: Brain dynamics foun- dation model with gradient positioning and spatiotemporal masking.Proceedings of the Advances in Neural Informa- tion Processing Systems (NeurIPS), 37:86048–86073, 2024. 1, 2, 5
work page 2024
-
[9]
fmriprep: a robust preprocessing pipeline for func- tional mri.Nat
Oscar Esteban, Christopher J Markiewicz, Ross W Blair, Craig A Moodie, A Ilkay Isik, Asier Erramuzpe, James D Kent, Mathias Goncalves, Elizabeth DuPre, Madeleine Sny- der, et al. fmriprep: a robust preprocessing pipeline for func- tional mri.Nat. Methods., 16(1):111–116, 2019. 4
work page 2019
-
[10]
Analysis of task-based functional mri data prepro- cessed with fmriprep.Nat
Oscar Esteban, Rastko Ciric, Karolina Finc, Ross W Blair, Christopher J Markiewicz, Craig A Moodie, James D Kent, Mathias Goncalves, Elizabeth DuPre, Daniel EP Gomez, et al. Analysis of task-based functional mri data prepro- cessed with fmriprep.Nat. Protoc., 15(7):2186–2202, 2020. 4
work page 2020
-
[11]
3d statistical neuroanatomical models from 305 mri volumes
Alan C Evans, D Louis Collins, SR Mills, Edward D Brown, Ryan L Kelly, and Terry M Peters. 3d statistical neuroanatomical models from 305 mri volumes. In1993 IEEE Conference Record Nuclear Science Symposium and Medical Imaging Conference (NSS MIC), pages 1813–1817,
-
[12]
Sex differences in default mode network connectivity in healthy aging adults
Bronte Ficek-Tani, Corey Horien, Suyeon Ju, Wanwan Xu, Nancy Li, Cheryl Lacadie, Xilin Shen, Dustin Scheinost, Todd Constable, and Carolyn Fredericks. Sex differences in default mode network connectivity in healthy aging adults. Cereb. Cortex., 33(10):6139–6151, 2023. 8
work page 2023
-
[13]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022. 2
work page 2022
-
[15]
Xuan Kan, Wei Dai, Hejie Cui, Zilong Zhang, Ying Guo, and Carl Yang. Brain network transformer. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 25586–25599, 2022. 1, 2, 4, 5
work page 2022
-
[16]
Jeremy Kawahara, Colin J Brown, Steven P Miller, Brian G Booth, Vann Chau, Ruth E Grunau, Jill G Zwicker, and Ghassan Hamarneh. Brainnetcnn: Convolutional neural net- works for brain networks; towards predicting neurodevelop- ment.Neuroimage, 146:1038–1049, 2017. 2, 5
work page 2017
-
[17]
Swift: Swin 4d fmri transformer
Peter Kim, Junbeom Kwon, Sunghwan Joo, Sangyoon Bae, Donggyu Lee, Yoonho Jung, Shinjae Yoo, Jiook Cha, and Taesup Moon. Swift: Swin 4d fmri transformer. InPro- ceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 42015–42037, 2023. 1, 2, 4, 5, 11
work page 2023
-
[18]
Self-supervised transformers for fmri representa- tion
Itzik Malkiel, Gony Rosenman, Lior Wolf, and Talma Hendler. Self-supervised transformers for fmri representa- tion. InInternational Conference on Medical Imaging with Deep Learning (MIDL), pages 895–913, 2022. 1, 2, 5
work page 2022
-
[19]
Multimodal population brain imaging in the uk biobank prospective epidemiological study.Nat
Karla L Miller, Fidel Alfaro-Almagro, Neal K Bangerter, David L Thomas, Essa Yacoub, Junqian Xu, Andreas J Bartsch, Saad Jbabdi, Stamatios N Sotiropoulos, Jesper LR Andersson, et al. Multimodal population brain imaging in the uk biobank prospective epidemiological study.Nat. Neu- rosci., 19(11):1523–1536, 2016. 4
work page 2016
-
[20]
Infraslow lfp cor- relates to resting-state fmri bold signals.Neuroimage, 74: 288–297, 2013
Wen-Ju Pan, Garth John Thompson, Matthew Evan Magnu- son, Dieter Jaeger, and Shella Keilholz. Infraslow lfp cor- relates to resting-state fmri bold signals.Neuroimage, 74: 288–297, 2013. 2
work page 2013
-
[21]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Process- ing Systems (NeurIPS), 2019. 4
work page 2019
-
[22]
Pavel Popov, Usman Mahmood, Zening Fu, Carl Yang, Vince Calhoun, and Sergey Plis. A simple but tough-to-beat baseline for fmri time-series classification.Neuroimage, 303: 120909, 2024. 1, 2, 5
work page 2024
-
[23]
Functional network organization of the hu- man brain.Neuron, 72(4):665–678, 2011
Jonathan D Power, Alexander L Cohen, Steven M Nelson, Gagan S Wig, Kelly Anne Barnes, Jessica A Church, Ale- cia C V ogel, Timothy O Laumann, Fran M Miezin, Bradley L Schlaggar, et al. Functional network organization of the hu- man brain.Neuron, 72(4):665–678, 2011. 1
work page 2011
-
[24]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Global waves synchronize the brain’s functional systems with fluc- tuating arousal.Sci
Ryan V Raut, Abraham Z Snyder, Anish Mitra, Dov Yellin, Naotaka Fujii, Rafael Malach, and Marcus E Raichle. Global waves synchronize the brain’s functional systems with fluc- tuating arousal.Sci. Adv., 7(30):eabf2709, 2021. 2
work page 2021
-
[26]
Srikanth Ryali, Yuan Zhang, Carlo de Los Angeles, Kaus- tubh Supekar, and Vinod Menon. Deep learning models re- veal replicable, generalizable, and behaviorally relevant sex differences in human functional brain organization.Proc. Natl. Acad. Sci. U.S.A., 121(9):e2310012121, 2024. 8
work page 2024
-
[27]
Sex dif- ferences in parietal lobe structure and development.Gend
Joel Salinas, Elizabeth D Mills, Amy L Conrad, Timothy Koscik, Nancy C Andreasen, and Peg Nopoulos. Sex dif- ferences in parietal lobe structure and development.Gend. Med., 9(1):44–55, 2012. 8
work page 2012
-
[28]
Alexander Schaefer, Ru Kong, Evan M Gordon, Timothy O Laumann, Xi-Nian Zuo, Avram J Holmes, Simon B Eick- hoff, and BT Thomas Yeo. Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity mri.Cereb. Cortex., 28(9):3095–3114, 2018. 5
work page 2018
-
[29]
Resting-state fmri in the hu- man connectome project.Neuroimage, 80:144–168, 2013
Stephen M Smith, Christian F Beckmann, Jesper Anders- son, Edward J Auerbach, Janine Bijsterbosch, Gwena ¨elle Douaud, Eugene Duff, David A Feinberg, Ludovica Grif- fanti, Michael P Harms, et al. Resting-state fmri in the hu- man connectome project.Neuroimage, 80:144–168, 2013. 4
work page 2013
-
[30]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[31]
Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: an open access re- source for identifying the causes of a wide range of complex diseases of middle and old age.PLoS Med., 12(3):e1001779,
-
[32]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InProceedings of the Inter- national Conference on Machine Learning (ICML), pages 3319–3328, 2017. 8
work page 2017
-
[33]
Ye Tian, Daniel S Margulies, Michael Breakspear, and An- drew Zalesky. Topographic organization of the human sub- cortex unveiled with functional connectivity gradients.Nat. Neurosci., 23(11):1421–1432, 2020. 5
work page 2020
-
[34]
Videomae: Masked autoencoders are data-efficient learn- ers for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learn- ers for self-supervised video pre-training. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 10078–10093, 2022. 2, 4
work page 2022
-
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Il- lia Polosukhin. Attention is all you need. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2017. 2, 4
work page 2017
-
[36]
Sex classification by resting state brain connectivity.Cereb
Susanne Weis, Kaustubh R Patil, Felix Hoffstaedter, Alessandra Nostro, BT Yeo, and Simon B Eickhoff. Sex classification by resting state brain connectivity.Cereb. Cor- tex., 30(2):824–835, 2020. 8
work page 2020
-
[37]
Simmim: A simple framework for masked image modeling
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR), pages 9653–9663, 2022. 2, 4 Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Mod...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.