pith. sign in

arxiv: 1907.11885 · v1 · pith:I4MKHVVUnew · submitted 2019-07-27 · 🧬 q-bio.NC · cs.CV

Effective and efficient ROI-wise visual encoding using an end-to-end CNN regression model and selective optimization

Pith reviewed 2026-05-24 15:01 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.CV
keywords visual encodingfMRICNN regressionend-to-end modelROI-wise encodingbrain activity predictionselective optimization
0
0 comments X

The pith

An end-to-end CNN regression model in ROI-wise manner predicts fMRI responses to images more accurately than the standard two-step encoding approach.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing the common two-step visual encoding process—extracting features from a pre-trained CNN then fitting linear regressions per voxel—with a single end-to-end CNN regression model applied to regions of interest. This end-to-end training lets the model discover feature representations that better match the fMRI data without prior assumptions about which layers or features will work. Selective optimization techniques, including adaptive weighting and noise regularization, further help by focusing on effective voxels within each ROI. Experiments show higher prediction accuracy, suggesting that joint optimization of feature learning and mapping can outperform the separated pipeline for brain activity prediction.

Core claim

The end-to-end convolution regression model (ETECRM) performs visual encoding by training a CNN to directly regress fMRI signals in an ROI-wise manner, using selective optimization with self-adapting weights, weighted correlation loss, and noise regularization; this yields better predicting accuracy than two-step models that first extract features from pre-trained networks and then apply linear regression per voxel.

What carries the argument

The end-to-end convolution regression model (ETECRM) that integrates feature extraction and regression in one trainable network for ROI-wise encoding.

If this is right

  • The model achieves higher accuracy in predicting brain responses to visual stimuli.
  • ROI-wise processing improves efficiency when encoding many voxels.
  • Automatic feature learning addresses the limitation of unknown optimal matches for fMRI data.
  • Selective optimization reduces interference from ineffective voxels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the end-to-end approach scales with larger datasets, it could shift visual encoding toward fully learned models rather than relying on computer vision pre-training.
  • Similar joint optimization might apply to encoding in other sensory modalities or with different imaging techniques.
  • Future work could test whether the learned features align with known properties of visual cortex.

Load-bearing premise

That jointly optimizing a CNN for both feature extraction and fMRI prediction will find better matches than using fixed pre-trained features.

What would settle it

A head-to-head comparison on the same fMRI dataset where the two-step model matches or exceeds the end-to-end model's prediction accuracy after equivalent tuning.

Figures

Figures reproduced from arXiv: 1907.11885 by Bin Yan, Chi Zhang, Jian Chen, Kai Qiao, Linyuan Wang, Li Tong.

Figure 1
Figure 1. Figure 1: The proposed method including end-to-end learning and ROI-wise encoding. a. Three spaces and two mapping are included in the linearizing encoding manner. b. Two-step manner of visual encoding including nonlinear feature transformation with fixed parameter and linear [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Recently, visual encoding based on functional magnetic resonance imaging (fMRI) have realized many achievements with the rapid development of deep network computation. Visual encoding model is aimed at predicting brain activity in response to presented image stimuli. Currently, visual encoding is accomplished mainly by firstly extracting image features through convolutional neural network (CNN) model pre-trained on computer vision task, and secondly training a linear regression model to map specific layer of CNN features to each voxel, namely voxel-wise encoding. However, the two-step manner model, essentially, is hard to determine which kind of well features are well linearly matched for beforehand unknown fMRI data with little understanding of human visual representation. Analogizing computer vision mostly related human vision, we proposed the end-to-end convolution regression model (ETECRM) in the region of interest (ROI)-wise manner to accomplish effective and efficient visual encoding. The end-to-end manner was introduced to make the model automatically learn better matching features to improve encoding performance. The ROI-wise manner was used to improve the encoding efficiency for many voxels. In addition, we designed the selective optimization including self-adapting weight learning and weighted correlation loss, noise regularization to avoid interfering of ineffective voxels in ROI-wise encoding. Experiment demonstrated that the proposed model obtained better predicting accuracy than the two-step manner of encoding models. Comparative analysis implied that end-to-end manner and large volume of fMRI data may drive the future development of visual encoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes an end-to-end CNN regression model (ETECRM) for ROI-wise visual encoding of fMRI responses to image stimuli. It contrasts this with the conventional two-step pipeline (pre-trained CNN feature extraction followed by voxel-wise linear regression), arguing that joint optimization allows automatic discovery of better-matched features. The model incorporates selective optimization via self-adapting weights, weighted correlation loss, and noise regularization to mitigate ineffective voxels. Experiments are reported to show higher predicting accuracy than two-step baselines, with the abstract suggesting that end-to-end training and large fMRI datasets will drive future progress.

Significance. If the accuracy gains can be robustly attributed to end-to-end training after proper controls, the work would provide evidence that joint feature learning improves encoding performance over separate extraction and regression steps, with potential implications for modeling human visual representations.

major comments (2)
  1. [Abstract, Experiments] Abstract and Experiments: The central claim attributes improved accuracy to the end-to-end manner automatically learning better matching features. However, the model simultaneously replaces voxel-wise with ROI-wise encoding and introduces selective optimization (self-adapting weight learning, weighted correlation loss, noise regularization). No ablation isolates the end-to-end component, and the two-step baseline is not shown to receive equivalent ROI-wise treatment or selective optimization, so the accuracy gain cannot be credited to end-to-end learning.
  2. [Experiments] Experiments: Performance is evaluated on the fMRI data used for training the model itself, with no mention of held-out test sets, cross-validation, or independent validation benchmarks. This circular evaluation prevents assessment of generalization to beforehand unknown fMRI data and undermines the accuracy claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and outline planned revisions.

read point-by-point responses
  1. Referee: [Abstract, Experiments] Abstract and Experiments: The central claim attributes improved accuracy to the end-to-end manner automatically learning better matching features. However, the model simultaneously replaces voxel-wise with ROI-wise encoding and introduces selective optimization (self-adapting weight learning, weighted correlation loss, noise regularization). No ablation isolates the end-to-end component, and the two-step baseline is not shown to receive equivalent ROI-wise treatment or selective optimization, so the accuracy gain cannot be credited to end-to-end learning.

    Authors: We agree that the current experiments do not isolate the contribution of end-to-end training from the ROI-wise formulation and selective optimization components. The two-step baseline follows the conventional voxel-wise approach in the literature, while ROI-wise encoding and selective optimization were introduced to enable efficient handling of multiple voxels and to mitigate ineffective ones. The end-to-end design permits joint optimization of features for the encoding objective, which is not possible in the separate two-step pipeline. To address the concern, we will add ablation studies in the revision that apply equivalent ROI-wise treatment and selective optimization to the two-step baseline, as well as remove selective optimization from the proposed model, to better attribute performance differences. revision: yes

  2. Referee: [Experiments] Experiments: Performance is evaluated on the fMRI data used for training the model itself, with no mention of held-out test sets, cross-validation, or independent validation benchmarks. This circular evaluation prevents assessment of generalization to beforehand unknown fMRI data and undermines the accuracy claim.

    Authors: The referee correctly notes that the reported experiments do not describe held-out test sets or cross-validation. We will revise the Experiments section to incorporate k-fold cross-validation (or equivalent held-out splits) on the fMRI data to evaluate generalization performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison stands on experimental results

full rationale

The paper is an empirical ML study comparing an end-to-end ROI-wise CNN regression model against two-step voxel-wise baselines on fMRI data. Its central claim rests on reported accuracy numbers from experiments rather than any mathematical derivation, uniqueness theorem, or self-referential definition that reduces the output to the input by construction. No equations are presented whose predictions are forced by fitted parameters or self-citations; the selective optimization components and ROI-wise change are explicit design choices whose effects are measured, not smuggled in. Even if train/test splits are imperfect, that is a methodological limitation, not a circular reduction of the claimed result to its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Abstract-only review yields limited ledger entries; the approach rests on standard CNN training assumptions and data-driven fitting without explicit numerical free parameters or new entities listed.

free parameters (2)
  • CNN regression weights
    Learned end-to-end from fMRI data to map image features to brain activity
  • self-adapting weights
    Introduced in selective optimization to handle ineffective voxels
axioms (2)
  • domain assumption End-to-end training automatically learns better matching features than pre-trained CNN layers for fMRI data
    Invoked to justify the ETECRM proposal over two-step models
  • domain assumption ROI-wise encoding improves efficiency without losing accuracy compared to voxel-wise encoding
    Used to motivate grouping voxels by region

pith-pipeline@v0.9.0 · 5803 in / 1225 out tokens · 36963 ms · 2026-05-24T15:01:59.620493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Brain magnetic resonance imaging with contrast dependent on blood oxygenation [J]

    Ogawa S, Lee T-M, Kay A R, et al. Brain magnetic resonance imaging with contrast dependent on blood oxygenation [J]. Proceedings of the National Academy of Sciences, 1990, 87(24): 9868-72

  2. [2]

    Predicting human brain activity associated with the meanings of nouns [J]

    Mitchell T M, Shinkareva S V, Carlson A, et al. Predicting human brain activity associated with the meanings of nouns [J]. science, 2008, 320(5880): 1191-5

  3. [3]

    Encoding and decoding in fMRI [J]

    Naselaris T, Kay K N, Nishimoto S, et al. Encoding and decoding in fMRI [J]. Neuroimage, 2011, 56(2): 400-10

  4. [4]

    A continuous semantic space describes the representation of thousands of object and action categories across the human brain [J]

    Huth A G, Nishimoto S, Vu A T, et al. A continuous semantic space describes the representation of thousands of object and action categories across the human brain [J]. Neuron, 2012, 76(6): 1210-24

  5. [5]

    Deep hierarchies in the primate visual cortex: What can we learn for computer vision? [J]

    Kruger N, Janssen P, Kalkan S, et al. Deep hierarchies in the primate visual cortex: What can we learn for computer vision? [J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(8): 1847-71

  6. [6]

    Distributed and overlapping representations of faces and objects in ventral temporal cortex [J]

    Haxby J V, Gobbini M I, Furey M L, et al. Distributed and overlapping representations of faces and objects in ventral temporal cortex [J]. Science, 2001, 293(5539): 2425 -30

  7. [7]

    Decoding of visual information from human brain activity: A review of fMRI and EEG studies [J]

    Zafar R, Malik A S, Kamel N, et al. Decoding of visual information from human brain activity: A review of fMRI and EEG studies [J]. Journal of integrative neuroscience, 2015, 14(02): 155 -68

  8. [8]

    Generic decoding of seen and imagined objects using hierarchical visual features [J]

    Horikawa T, Kamitani Y. Generic decoding of seen and imagined objects using hierarchical visual features [J]. Nature communications, 2017, 8(15037

  9. [9]

    Decoding natural images from evoked brain activities using encoding models with invertible mapping [J]

    Li C, Xu J, Liu B. Decoding natural images from evoked brain activities using encoding models with invertible mapping [J]. Neural Networks, 2018, 105(227-35

  10. [10]

    Qiao K, Chen J, Wang L, et al. Category decoding of visual stimuli from huma n brain activity using a bidirectional recurrent neural network to simulate bidirectional information flows in human visual cortices [J]. arXiv preprint arXiv:190307783, 2019,

  11. [11]

    A real -time fMRI-based spelling dev ice immediately enabling robust motor-independent communication [J]

    Sorger B, Reithler J, Dahmen B, et al. A real -time fMRI-based spelling dev ice immediately enabling robust motor-independent communication [J]. Current Biology, 2012, 22(14): 1333 -8. 18

  12. [12]

    A visual encoding model based on deep neural networks and transfer learning for brain activity measured by func tional magnetic resonance imaging [J]

    Zhang C, Qiao K, Wang L, et al. A visual encoding model based on deep neural networks and transfer learning for brain activity measured by func tional magnetic resonance imaging [J]. Journal of Neuroscience Methods, 2019, 108318

  13. [13]

    Representation learning: A review and new perspectives [J]

    Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives [J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(8): 1798-828

  14. [14]

    Overfeat: Integrated recognition, localization and detection using convolutional networks [J]

    Sermanet P, Eigen D, Zhang X, et al. Overfeat: Integrated recognition, localization and detection using convolutional networks [J]. arXiv preprint arXiv:13126229, 2013,

  15. [15]

    Deep residual learning for imag e recognition; proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, F, 2016 [C]

    He K, Zhang X, Ren S, et al. Deep residual learning for imag e recognition; proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, F, 2016 [C]

  16. [16]

    A discriminatively trained, multiscale, deformable part model; proceedings of the Co mputer Vision and Pattern Recognition, 2008 CVPR 2008 IEEE Conference on, F, 2008 [C]

    Felzenszwalb P, McAllester D, Ramanan D. A discriminatively trained, multiscale, deformable part model; proceedings of the Co mputer Vision and Pattern Recognition, 2008 CVPR 2008 IEEE Conference on, F, 2008 [C]. IEEE

  17. [17]

    Coding facial expressions with gabor wavelets; proceedings of the Automatic Face and Gesture Recognition, 1998 Proceedings Third IEEE International Conference on, F, 1998 [C]

    Lyons M, Akamatsu S, Kamachi M, et al. Coding facial expressions with gabor wavelets; proceedings of the Automatic Face and Gesture Recognition, 1998 Proceedings Third IEEE International Conference on, F, 1998 [C]. IEEE

  18. [18]

    Histograms of oriented gradients for human detection; proceedings of the Computer Vision and Pattern Recognition, 2005 CVPR 2005 IEEE Computer Society Conference on, F, 2005 [C]

    Dalal N, Triggs B. Histograms of oriented gradients for human detection; proceedings of the Computer Vision and Pattern Recognition, 2005 CVPR 2005 IEEE Computer Society Conference on, F, 2005 [C]. IEEE

  19. [19]

    Face description with local binary patterns: Application to face recognition [J]

    Ahonen T, Hadid A, Pietikainen M. Face description with local binary patterns: Application to face recognition [J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2006, 12): 2037 -41

  20. [20]

    SIFT: Predicting amino acid changes that affect protein function [J]

    Ng P C, Henikoff S. SIFT: Predicting amino acid changes that affect protein function [J]. Nucleic acids research, 2003, 31(13): 3812-4

  21. [21]

    Identifying natural images from human brain activity [J]

    Kay K N, Naselaris T, Prenger R J, et al. Identifying natural images from human brain activity [J]. Nature, 2008, 452(7185): 352

  22. [22]

    ImageNet classification with deep convolutional neural networks; proceedings of the International Conference on Neural Information Processing Systems, F, 2012 [C]

    Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks; proceedings of the International Conference on Neural Information Processing Systems, F, 2012 [C]

  23. [23]

    ImageNet Large Scale Vis ual Recognition Challenge [J]

    Russakovsky O, Deng J, Su H, et al. ImageNet Large Scale Vis ual Recognition Challenge [J]. International Journal of Computer Vision, 2015, 115(3): 211-52

  24. [24]

    Deep learning [J]

    LeCun Y, Bengio Y, Hinton G. Deep learning [J]. nature, 2015, 521(7553): 436

  25. [25]

    Deep learning [M]

    Goodfellow I, Bengio Y, Courville A, et al. Deep learning [M]. MIT press Cambridge, 2016

  26. [26]

    Pixels to voxels: modeling visual representation in the human brain [J]

    Agrawal P, Stansbury D, Malik J, et al. Pixels to voxels: modeling visual representation in the human brain [J]. arXiv preprint arXiv:14075104, 2014,

  27. [27]

    Performance -optimized hierarchical mode ls predict neural responses in higher visual cortex [J]

    Yamins D L, Hong H, Cadieu C F, et al. Performance -optimized hierarchical mode ls predict neural responses in higher visual cortex [J]. Proceedings of the National Academy of Sciences, 2014, 111(23): 8619-24

  28. [28]

    Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream [J]

    Gü ç lü U, van Gerven M A. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream [J]. Journal of Neuroscience, 2015, 35(27): 10005-14

  29. [29]

    Seeing it all: Convolutional network layers map the function of the human visual system [J]

    Eickenberg M, Gramfort A, Varoquaux G, et al. Seeing it all: Convolutional network layers map the function of the human visual system [J]. Neuroimage, 2016, 152(

  30. [30]

    The feature -weighted receptive field: an interpretable encoding model for complex feature spaces [J]

    Styves G, Naselaris T. The feature -weighted receptive field: an interpretable encoding model for complex feature spaces [J]. Neuroimage, 2017,

  31. [31]

    Deep Residual Network Predicts Cortical Representation and Organization of Visual Features for Rapid Categorization [J]

    Wen H, Shi J, Chen W, et al. Deep Residual Network Predicts Cortical Representation and Organization of Visual Features for Rapid Categorization [J]. Scientific reports, 2018, 8(1): 3752

  32. [32]

    Deep recurrent neural network reveals a hierarchy of process memory during dynamic natural vision [J]

    Shi J, Wen H, Zhang Y, et al. Deep recurrent neural network reveals a hierarchy of process memory during dynamic natural vision [J]. Human brain mapping, 2018, 39(5): 226 9-82

  33. [33]

    Variational autoencoder: An unsupervised model for modeling and decoding fMRI activity in visual cortex [J]

    Han K, Wen H, Shi J, et al. Variational autoencoder: An unsupervised model for modeling and decoding fMRI activity in visual cortex [J]. bioRxiv, 2017, 214247

  34. [34]

    Accurate reconstruction of image stimuli from hu man functional magnetic resonance imaging based on the decoding model with capsule network architecture [J]

    Qiao K, Zhang C, Wang L, et al. Accurate reconstruction of image stimuli from hu man functional magnetic resonance imaging based on the decoding model with capsule network architecture [J]. Frontiers in neuroinformatics, 2018, 12

  35. [35]

    BOLD5000, a public fMRI dataset while viewing 5000 visual images [J]

    Chang N, Pyles J A, Marcus A, et al. BOLD5000, a public fMRI dataset while viewing 5000 visual images [J]. Scientific data, 2019, 6(1): 49

  36. [36]

    Bayesian Reconstruction of Natural Images from Human Brain Activity: Neuron [J]

    Naselaris T, Prenger R J, Kay K N, et al. Bayesian Reconstruction of Natural Images from Human Brain Activity: Neuron [J]. Neuron, 2009, 63(6): 902-15

  37. [37]

    A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, F, 2001 [C]

    Martin D, Fowlkes C, Tal D, et al. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, F, 2001 [C]. Iccv Vancouver:

  38. [38]

    Gradient -based learning applied to document recognition [J]

    LeCun Y, Bottou L, Bengio Y, et al. Gradient -based learning applied to document recognition [J]. Proceedings of the IEEE, 1998, 86(11): 2278-324

  39. [39]

    Introduction to pytorch [M]

    Ketkar N. Introduction to pytorch [M]. Deep Learning with Python. Springer. 2017: 195 -208. 19

  40. [40]

    Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit [J]

    Needell D, Vershynin R. Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit [J]. IEEE Journal of selected topics in signal processing, 2010, 4(2): 310-6