Effective and efficient ROI-wise visual encoding using an end-to-end CNN regression model and selective optimization
Pith reviewed 2026-05-24 15:01 UTC · model grok-4.3
The pith
An end-to-end CNN regression model in ROI-wise manner predicts fMRI responses to images more accurately than the standard two-step encoding approach.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The end-to-end convolution regression model (ETECRM) performs visual encoding by training a CNN to directly regress fMRI signals in an ROI-wise manner, using selective optimization with self-adapting weights, weighted correlation loss, and noise regularization; this yields better predicting accuracy than two-step models that first extract features from pre-trained networks and then apply linear regression per voxel.
What carries the argument
The end-to-end convolution regression model (ETECRM) that integrates feature extraction and regression in one trainable network for ROI-wise encoding.
If this is right
- The model achieves higher accuracy in predicting brain responses to visual stimuli.
- ROI-wise processing improves efficiency when encoding many voxels.
- Automatic feature learning addresses the limitation of unknown optimal matches for fMRI data.
- Selective optimization reduces interference from ineffective voxels.
Where Pith is reading between the lines
- If the end-to-end approach scales with larger datasets, it could shift visual encoding toward fully learned models rather than relying on computer vision pre-training.
- Similar joint optimization might apply to encoding in other sensory modalities or with different imaging techniques.
- Future work could test whether the learned features align with known properties of visual cortex.
Load-bearing premise
That jointly optimizing a CNN for both feature extraction and fMRI prediction will find better matches than using fixed pre-trained features.
What would settle it
A head-to-head comparison on the same fMRI dataset where the two-step model matches or exceeds the end-to-end model's prediction accuracy after equivalent tuning.
Figures
read the original abstract
Recently, visual encoding based on functional magnetic resonance imaging (fMRI) have realized many achievements with the rapid development of deep network computation. Visual encoding model is aimed at predicting brain activity in response to presented image stimuli. Currently, visual encoding is accomplished mainly by firstly extracting image features through convolutional neural network (CNN) model pre-trained on computer vision task, and secondly training a linear regression model to map specific layer of CNN features to each voxel, namely voxel-wise encoding. However, the two-step manner model, essentially, is hard to determine which kind of well features are well linearly matched for beforehand unknown fMRI data with little understanding of human visual representation. Analogizing computer vision mostly related human vision, we proposed the end-to-end convolution regression model (ETECRM) in the region of interest (ROI)-wise manner to accomplish effective and efficient visual encoding. The end-to-end manner was introduced to make the model automatically learn better matching features to improve encoding performance. The ROI-wise manner was used to improve the encoding efficiency for many voxels. In addition, we designed the selective optimization including self-adapting weight learning and weighted correlation loss, noise regularization to avoid interfering of ineffective voxels in ROI-wise encoding. Experiment demonstrated that the proposed model obtained better predicting accuracy than the two-step manner of encoding models. Comparative analysis implied that end-to-end manner and large volume of fMRI data may drive the future development of visual encoding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an end-to-end CNN regression model (ETECRM) for ROI-wise visual encoding of fMRI responses to image stimuli. It contrasts this with the conventional two-step pipeline (pre-trained CNN feature extraction followed by voxel-wise linear regression), arguing that joint optimization allows automatic discovery of better-matched features. The model incorporates selective optimization via self-adapting weights, weighted correlation loss, and noise regularization to mitigate ineffective voxels. Experiments are reported to show higher predicting accuracy than two-step baselines, with the abstract suggesting that end-to-end training and large fMRI datasets will drive future progress.
Significance. If the accuracy gains can be robustly attributed to end-to-end training after proper controls, the work would provide evidence that joint feature learning improves encoding performance over separate extraction and regression steps, with potential implications for modeling human visual representations.
major comments (2)
- [Abstract, Experiments] Abstract and Experiments: The central claim attributes improved accuracy to the end-to-end manner automatically learning better matching features. However, the model simultaneously replaces voxel-wise with ROI-wise encoding and introduces selective optimization (self-adapting weight learning, weighted correlation loss, noise regularization). No ablation isolates the end-to-end component, and the two-step baseline is not shown to receive equivalent ROI-wise treatment or selective optimization, so the accuracy gain cannot be credited to end-to-end learning.
- [Experiments] Experiments: Performance is evaluated on the fMRI data used for training the model itself, with no mention of held-out test sets, cross-validation, or independent validation benchmarks. This circular evaluation prevents assessment of generalization to beforehand unknown fMRI data and undermines the accuracy claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and outline planned revisions.
read point-by-point responses
-
Referee: [Abstract, Experiments] Abstract and Experiments: The central claim attributes improved accuracy to the end-to-end manner automatically learning better matching features. However, the model simultaneously replaces voxel-wise with ROI-wise encoding and introduces selective optimization (self-adapting weight learning, weighted correlation loss, noise regularization). No ablation isolates the end-to-end component, and the two-step baseline is not shown to receive equivalent ROI-wise treatment or selective optimization, so the accuracy gain cannot be credited to end-to-end learning.
Authors: We agree that the current experiments do not isolate the contribution of end-to-end training from the ROI-wise formulation and selective optimization components. The two-step baseline follows the conventional voxel-wise approach in the literature, while ROI-wise encoding and selective optimization were introduced to enable efficient handling of multiple voxels and to mitigate ineffective ones. The end-to-end design permits joint optimization of features for the encoding objective, which is not possible in the separate two-step pipeline. To address the concern, we will add ablation studies in the revision that apply equivalent ROI-wise treatment and selective optimization to the two-step baseline, as well as remove selective optimization from the proposed model, to better attribute performance differences. revision: yes
-
Referee: [Experiments] Experiments: Performance is evaluated on the fMRI data used for training the model itself, with no mention of held-out test sets, cross-validation, or independent validation benchmarks. This circular evaluation prevents assessment of generalization to beforehand unknown fMRI data and undermines the accuracy claim.
Authors: The referee correctly notes that the reported experiments do not describe held-out test sets or cross-validation. We will revise the Experiments section to incorporate k-fold cross-validation (or equivalent held-out splits) on the fMRI data to evaluate generalization performance. revision: yes
Circularity Check
No significant circularity; empirical comparison stands on experimental results
full rationale
The paper is an empirical ML study comparing an end-to-end ROI-wise CNN regression model against two-step voxel-wise baselines on fMRI data. Its central claim rests on reported accuracy numbers from experiments rather than any mathematical derivation, uniqueness theorem, or self-referential definition that reduces the output to the input by construction. No equations are presented whose predictions are forced by fitted parameters or self-citations; the selective optimization components and ROI-wise change are explicit design choices whose effects are measured, not smuggled in. Even if train/test splits are imperfect, that is a methodological limitation, not a circular reduction of the claimed result to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- CNN regression weights
- self-adapting weights
axioms (2)
- domain assumption End-to-end training automatically learns better matching features than pre-trained CNN layers for fMRI data
- domain assumption ROI-wise encoding improves efficiency without losing accuracy compared to voxel-wise encoding
Reference graph
Works this paper leans on
-
[1]
Brain magnetic resonance imaging with contrast dependent on blood oxygenation [J]
Ogawa S, Lee T-M, Kay A R, et al. Brain magnetic resonance imaging with contrast dependent on blood oxygenation [J]. Proceedings of the National Academy of Sciences, 1990, 87(24): 9868-72
work page 1990
-
[2]
Predicting human brain activity associated with the meanings of nouns [J]
Mitchell T M, Shinkareva S V, Carlson A, et al. Predicting human brain activity associated with the meanings of nouns [J]. science, 2008, 320(5880): 1191-5
work page 2008
-
[3]
Encoding and decoding in fMRI [J]
Naselaris T, Kay K N, Nishimoto S, et al. Encoding and decoding in fMRI [J]. Neuroimage, 2011, 56(2): 400-10
work page 2011
-
[4]
Huth A G, Nishimoto S, Vu A T, et al. A continuous semantic space describes the representation of thousands of object and action categories across the human brain [J]. Neuron, 2012, 76(6): 1210-24
work page 2012
-
[5]
Deep hierarchies in the primate visual cortex: What can we learn for computer vision? [J]
Kruger N, Janssen P, Kalkan S, et al. Deep hierarchies in the primate visual cortex: What can we learn for computer vision? [J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(8): 1847-71
work page 2013
-
[6]
Distributed and overlapping representations of faces and objects in ventral temporal cortex [J]
Haxby J V, Gobbini M I, Furey M L, et al. Distributed and overlapping representations of faces and objects in ventral temporal cortex [J]. Science, 2001, 293(5539): 2425 -30
work page 2001
-
[7]
Decoding of visual information from human brain activity: A review of fMRI and EEG studies [J]
Zafar R, Malik A S, Kamel N, et al. Decoding of visual information from human brain activity: A review of fMRI and EEG studies [J]. Journal of integrative neuroscience, 2015, 14(02): 155 -68
work page 2015
-
[8]
Generic decoding of seen and imagined objects using hierarchical visual features [J]
Horikawa T, Kamitani Y. Generic decoding of seen and imagined objects using hierarchical visual features [J]. Nature communications, 2017, 8(15037
work page 2017
-
[9]
Li C, Xu J, Liu B. Decoding natural images from evoked brain activities using encoding models with invertible mapping [J]. Neural Networks, 2018, 105(227-35
work page 2018
-
[10]
Qiao K, Chen J, Wang L, et al. Category decoding of visual stimuli from huma n brain activity using a bidirectional recurrent neural network to simulate bidirectional information flows in human visual cortices [J]. arXiv preprint arXiv:190307783, 2019,
work page 2019
-
[11]
Sorger B, Reithler J, Dahmen B, et al. A real -time fMRI-based spelling dev ice immediately enabling robust motor-independent communication [J]. Current Biology, 2012, 22(14): 1333 -8. 18
work page 2012
-
[12]
Zhang C, Qiao K, Wang L, et al. A visual encoding model based on deep neural networks and transfer learning for brain activity measured by func tional magnetic resonance imaging [J]. Journal of Neuroscience Methods, 2019, 108318
work page 2019
-
[13]
Representation learning: A review and new perspectives [J]
Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives [J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(8): 1798-828
work page 2013
-
[14]
Overfeat: Integrated recognition, localization and detection using convolutional networks [J]
Sermanet P, Eigen D, Zhang X, et al. Overfeat: Integrated recognition, localization and detection using convolutional networks [J]. arXiv preprint arXiv:13126229, 2013,
work page 2013
-
[15]
He K, Zhang X, Ren S, et al. Deep residual learning for imag e recognition; proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, F, 2016 [C]
work page 2016
-
[16]
Felzenszwalb P, McAllester D, Ramanan D. A discriminatively trained, multiscale, deformable part model; proceedings of the Co mputer Vision and Pattern Recognition, 2008 CVPR 2008 IEEE Conference on, F, 2008 [C]. IEEE
work page 2008
-
[17]
Lyons M, Akamatsu S, Kamachi M, et al. Coding facial expressions with gabor wavelets; proceedings of the Automatic Face and Gesture Recognition, 1998 Proceedings Third IEEE International Conference on, F, 1998 [C]. IEEE
work page 1998
-
[18]
Dalal N, Triggs B. Histograms of oriented gradients for human detection; proceedings of the Computer Vision and Pattern Recognition, 2005 CVPR 2005 IEEE Computer Society Conference on, F, 2005 [C]. IEEE
work page 2005
-
[19]
Face description with local binary patterns: Application to face recognition [J]
Ahonen T, Hadid A, Pietikainen M. Face description with local binary patterns: Application to face recognition [J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2006, 12): 2037 -41
work page 2006
-
[20]
SIFT: Predicting amino acid changes that affect protein function [J]
Ng P C, Henikoff S. SIFT: Predicting amino acid changes that affect protein function [J]. Nucleic acids research, 2003, 31(13): 3812-4
work page 2003
-
[21]
Identifying natural images from human brain activity [J]
Kay K N, Naselaris T, Prenger R J, et al. Identifying natural images from human brain activity [J]. Nature, 2008, 452(7185): 352
work page 2008
-
[22]
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks; proceedings of the International Conference on Neural Information Processing Systems, F, 2012 [C]
work page 2012
-
[23]
ImageNet Large Scale Vis ual Recognition Challenge [J]
Russakovsky O, Deng J, Su H, et al. ImageNet Large Scale Vis ual Recognition Challenge [J]. International Journal of Computer Vision, 2015, 115(3): 211-52
work page 2015
-
[24]
LeCun Y, Bengio Y, Hinton G. Deep learning [J]. nature, 2015, 521(7553): 436
work page 2015
-
[25]
Goodfellow I, Bengio Y, Courville A, et al. Deep learning [M]. MIT press Cambridge, 2016
work page 2016
-
[26]
Pixels to voxels: modeling visual representation in the human brain [J]
Agrawal P, Stansbury D, Malik J, et al. Pixels to voxels: modeling visual representation in the human brain [J]. arXiv preprint arXiv:14075104, 2014,
work page 2014
-
[27]
Performance -optimized hierarchical mode ls predict neural responses in higher visual cortex [J]
Yamins D L, Hong H, Cadieu C F, et al. Performance -optimized hierarchical mode ls predict neural responses in higher visual cortex [J]. Proceedings of the National Academy of Sciences, 2014, 111(23): 8619-24
work page 2014
-
[28]
Gü ç lü U, van Gerven M A. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream [J]. Journal of Neuroscience, 2015, 35(27): 10005-14
work page 2015
-
[29]
Seeing it all: Convolutional network layers map the function of the human visual system [J]
Eickenberg M, Gramfort A, Varoquaux G, et al. Seeing it all: Convolutional network layers map the function of the human visual system [J]. Neuroimage, 2016, 152(
work page 2016
-
[30]
Styves G, Naselaris T. The feature -weighted receptive field: an interpretable encoding model for complex feature spaces [J]. Neuroimage, 2017,
work page 2017
-
[31]
Wen H, Shi J, Chen W, et al. Deep Residual Network Predicts Cortical Representation and Organization of Visual Features for Rapid Categorization [J]. Scientific reports, 2018, 8(1): 3752
work page 2018
-
[32]
Shi J, Wen H, Zhang Y, et al. Deep recurrent neural network reveals a hierarchy of process memory during dynamic natural vision [J]. Human brain mapping, 2018, 39(5): 226 9-82
work page 2018
-
[33]
Han K, Wen H, Shi J, et al. Variational autoencoder: An unsupervised model for modeling and decoding fMRI activity in visual cortex [J]. bioRxiv, 2017, 214247
work page 2017
-
[34]
Qiao K, Zhang C, Wang L, et al. Accurate reconstruction of image stimuli from hu man functional magnetic resonance imaging based on the decoding model with capsule network architecture [J]. Frontiers in neuroinformatics, 2018, 12
work page 2018
-
[35]
BOLD5000, a public fMRI dataset while viewing 5000 visual images [J]
Chang N, Pyles J A, Marcus A, et al. BOLD5000, a public fMRI dataset while viewing 5000 visual images [J]. Scientific data, 2019, 6(1): 49
work page 2019
-
[36]
Bayesian Reconstruction of Natural Images from Human Brain Activity: Neuron [J]
Naselaris T, Prenger R J, Kay K N, et al. Bayesian Reconstruction of Natural Images from Human Brain Activity: Neuron [J]. Neuron, 2009, 63(6): 902-15
work page 2009
-
[37]
Martin D, Fowlkes C, Tal D, et al. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, F, 2001 [C]. Iccv Vancouver:
work page 2001
-
[38]
Gradient -based learning applied to document recognition [J]
LeCun Y, Bottou L, Bengio Y, et al. Gradient -based learning applied to document recognition [J]. Proceedings of the IEEE, 1998, 86(11): 2278-324
work page 1998
-
[39]
Ketkar N. Introduction to pytorch [M]. Deep Learning with Python. Springer. 2017: 195 -208. 19
work page 2017
-
[40]
Needell D, Vershynin R. Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit [J]. IEEE Journal of selected topics in signal processing, 2010, 4(2): 310-6
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.