Learning to aggregate feature representations
Pith reviewed 2026-05-25 11:45 UTC · model grok-4.3
The pith
Learning aggregation weights over the stages of a pretrained image network produces a flexible encoder for brain activity from images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training a set of aggregation weights on the five-stage outputs of an SE-ResNeXt-50 network allows the same backbone to serve as an effective image-to-brain encoder for multiple subjects, regions of interest, and recording modalities; the optimized weights consistently favor the later stages for both early visual cortex and higher-order inferotemporal cortex.
What carries the argument
The learned aggregation weights that modulate and screen the feature maps produced at each of the five successive stages of the fixed classification network.
If this is right
- The identical aggregation procedure works for both fMRI and MEG measurements without modality-specific changes.
- The same weights support prediction across multiple subjects and across both early and late visual regions of interest.
- Later network stages receive higher aggregation weight than earlier ones even when the target region is early visual cortex.
- The approach requires no retraining of the backbone network itself.
Where Pith is reading between the lines
- If later stages dominate, brain encoding models may benefit more from features that pool over large spatial extents than from the locally tuned early filters.
- Replacing the backbone with a different pretrained network would test whether the preference for later stages is architecture-specific or general.
- The same aggregation logic could be applied to other hierarchical sensory models if the goal is to predict responses from limited paired data.
Load-bearing premise
The features already present in the intermediate layers of an ImageNet-pretrained classification network are sufficient to explain brain responses once the right combination weights are found.
What would settle it
A fixed uniform average of the same five stages, or an aggregation that uses only the earliest stages, produces equal or higher prediction accuracy on the same multi-subject, multi-modality test data.
Figures
read the original abstract
The Algonauts challenge requires to construct a multi-subject encoder of images to brain activity. Deep networks such as ResNet-50 and AlexNet trained for image classification are known to produce feature representations along their intermediate stages which closely mimic the visual hierarchy. However the challenges introduced in the Algonauts project, including combining data from multiple subjects, relying on very few similarity data points, solving for various ROIs, and multi-modality, require devising a flexible framework which can efficiently accommodate them. Here we build upon a recent state-of-the-art classification network (SE-ResNeXt-50) and construct an adaptive combination of its intermediate representations. While the pretrained network serves as a backbone of our model, we learn how to aggregate feature representations along five stages of the network. During learning, our method enables to modulate and screen outputs from each stage along the network as governed by the optimized objective. We applied our method to the Algonauts2019 fMRI and MEG challenges. Using the combined fMRI and MEG data, our approach was rated among the leading five for both challenges. Surprisingly we find that for both the lower and higher order areas (EVC and IT) the adaptive aggregation favors features stemming at later stages of the network.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes learning adaptive aggregation weights over the five intermediate stages of a fixed, ImageNet-pretrained SE-ResNeXt-50 backbone to predict multi-subject fMRI and MEG responses in the Algonauts 2019 challenge. The method modulates and screens stage outputs during optimization against the challenge similarity data and reports a top-5 ranking for both modalities. The central empirical claim is that the learned weights assign higher importance to later stages for both EVC and IT.
Significance. If the stage-preference result survives controls for feature magnitude and optimization bias, the finding would be noteworthy because it suggests that later-stage features from a classification network remain useful even for early visual areas, contrary to the usual expectation that EVC should be best captured by early layers. The approach supplies a compact, learnable aggregation mechanism that accommodates multi-subject and multi-modality data with few similarity pairs. No machine-checked proofs or parameter-free derivations are present; the contribution is empirical and challenge-oriented.
major comments (2)
- [Abstract] Abstract: the claim that 'the adaptive aggregation favors features stemming at later stages of the network' for both EVC and IT is load-bearing for the headline result, yet the text supplies neither the learned weight vectors, their standard errors across subjects or folds, nor any ablation that normalizes per-stage activation norms before aggregation. Without such controls it is impossible to distinguish genuine feature utility from a simple magnitude bias, since later stages of SE-ResNeXt-50 typically have larger channel counts and activation scales.
- [Abstract] The manuscript states that weights are 'learned against the challenge data' but provides no description of the aggregation operator (linear, gated, or otherwise), the loss, regularization on the weights, or any test that the small number of similarity pairs does not produce degenerate or unstable solutions. This directly affects whether the reported later-stage preference can be interpreted as reflecting brain-feature alignment rather than fitting artifacts.
minor comments (1)
- [Abstract] The abstract refers to 'five stages of the network' without defining the exact layer boundaries or residual-block groupings used for SE-ResNeXt-50; a table or figure listing the stages would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'the adaptive aggregation favors features stemming at later stages of the network' for both EVC and IT is load-bearing for the headline result, yet the text supplies neither the learned weight vectors, their standard errors across subjects or folds, nor any ablation that normalizes per-stage activation norms before aggregation. Without such controls it is impossible to distinguish genuine feature utility from a simple magnitude bias, since later stages of SE-ResNeXt-50 typically have larger channel counts and activation scales.
Authors: We agree that the abstract and main text would benefit from explicit reporting of the learned aggregation weights and their variability. In the revised manuscript we will add a table or figure showing the mean weights per stage (across subjects and cross-validation folds) together with standard errors. We will also include a control experiment in which per-stage feature maps are L2-normalized before aggregation; if the later-stage preference persists under this normalization, it will strengthen the claim that the result reflects feature utility rather than magnitude differences. The current evidence for the preference rests on its consistency across independent subjects and both fMRI and MEG modalities, but the proposed ablation is a valuable addition. revision: yes
-
Referee: [Abstract] The manuscript states that weights are 'learned against the challenge data' but provides no description of the aggregation operator (linear, gated, or otherwise), the loss, regularization on the weights, or any test that the small number of similarity pairs does not produce degenerate or unstable solutions. This directly affects whether the reported later-stage preference can be interpreted as reflecting brain-feature alignment rather than fitting artifacts.
Authors: The abstract is intentionally concise; the full methods section describes the aggregation as a learnable linear combination of stage-wise pooled features (with per-stage projection to a common dimensionality) optimized directly against the challenge similarity metric. We will expand the methods description and add a short paragraph in the main text that explicitly states the operator, the loss, the L2 regularization applied to the aggregation weights, and the cross-validation protocol used to assess solution stability. The consistency of the later-stage preference across multiple random seeds and subject-wise folds already provides evidence against degeneracy, but we will report these diagnostics more prominently. revision: yes
Circularity Check
No circularity: empirical observation from learned weights, not a derivation reducing to inputs
full rationale
The paper trains aggregation weights on the Algonauts challenge data using a fixed pretrained SE-ResNeXt-50 backbone and reports the resulting stage preferences as an empirical finding. No derivation chain is claimed that reduces a 'prediction' or 'first-principles result' to the fitted inputs by construction, nor are there self-citations, uniqueness theorems, or ansatzes that close the loop. The central result is an observation about the optimized weights rather than a forced equivalence.
Axiom & Free-Parameter Ledger
free parameters (1)
- stage aggregation weights
axioms (1)
- domain assumption Intermediate layers of SE-ResNeXt-50 capture a visual hierarchy comparable to the brain
Reference graph
Works this paper leans on
-
[1]
U. Guclu, Marcel A. J. van Gerven, Umut Güçlü, and Marcel A. J. van Gerven. Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream. Journal of Neuroscience, 35(27):10005–10014, 7 2015
work page 2015
-
[2]
Tomoyasu Horikawa and Yukiyasu Kamitani. Hierarchical Neural Representation of Dreamed Objects Revealed by Brain Decoding with Deep Neural Network Features. Frontiers in Computational Neuroscience, 11, 2017
work page 2017
-
[3]
Haiguang Wen, Junxing Shi, Wei Chen, and Zhongming Liu. Deep Residual Network Predicts Cortical Rep- resentation and Organization of Visual Features for Rapid Categorization. Scientific Reports, 8(1):3752, 12 2018
work page 2018
-
[4]
Shany Grossman, Guy Gaziv, Erin M Yeagle, Michal Harel, Pierre Megevand, David M Groppe, Simon Khuvis, Jose L Herrero, Michal Irani, Ashesh D Mehta, and Rafael Malach. Deep Convolutional modeling of human face selective columns reveals their role in pictorial face representation. bioRxiv, page 444323, 10 2018
work page 2018
-
[5]
End-to-end deep image reconstruction from human brain activity
Guohua Shen, Kshitij Dwivedi, Kei Majima, Tomoyasu Horikawa, and Yukiyasu Kamitani. End-to-end deep image reconstruction from human brain activity. bioRxiv, page 272518, 2018
work page 2018
-
[6]
Seeing it all: Convolutional network layers map the function of the human visual system
Michael Eickenberg, Alexandre Gramfort, Gaël Varoquaux, and Bertrand Thirion. Seeing it all: Convolutional network layers map the function of the human visual system. NeuroImage, 152:184–194, 2017
work page 2017
-
[7]
Optimizing deep video representation to match brain activity
Hugo Richard, Ana Pinho, Bertrand Thirion, and Guillaume Charpiat. Optimizing deep video representation to match brain activity. Technical report, 2018
work page 2018
-
[8]
Radoslaw Martin Cichy and Gemma Roig. The Algonauts Project: A Platform for Communication between the Sciences of Biological and Artificial Intelligence. Technical report
-
[9]
Representational similarity analysis – connecting the branches of systems neuroscience
Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. Representational similarity analysis – connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2:4, 11 2008
work page 2008
-
[10]
Radoslaw Martin Cichy, Dimitrios Pantazis, and Aude Oliva. Similarity-Based Fusion of MEG and fMRI Reveals Spatio-Temporal Dynamics in Human Cortex During Visual Object Recognition. 2016
work page 2016
-
[11]
Hypercolumns for object segmentation and fine-grained localization
Bharath Hariharan, Pablo Arbelaez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 447–456. IEEE, 6 2015
work page 2015
-
[12]
Squeeze-and-Excitation Networks
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-Excitation Networks. Technical report
-
[13]
Benchmark Analysis of Representative Deep Neural Network Architectures
Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. Benchmark Analysis of Representative Deep Neural Network Architectures. 10 2018. 5
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.