Learning to aggregate feature representations

Guy Gaziv

arxiv: 1907.01034 · v3 · pith:OVWPVFRNnew · submitted 2019-07-01 · 💻 cs.CV · eess.IV· q-bio.NC

Learning to aggregate feature representations

Guy Gaziv This is my paper

Pith reviewed 2026-05-25 11:45 UTC · model grok-4.3

classification 💻 cs.CV eess.IVq-bio.NC

keywords feature aggregationbrain encodingdeep neural networksvisual cortexfMRIMEGimage classification featuresmulti-subject modeling

0 comments

The pith

Learning aggregation weights over the stages of a pretrained image network produces a flexible encoder for brain activity from images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs an encoder that maps images to measured brain responses by learning how to combine the outputs of five successive stages in a fixed, ImageNet-trained classification network. This single learned aggregation step is shown to handle data from many subjects at once, work with sparse similarity measurements, cover multiple brain regions, and apply to both fMRI and MEG recordings. The resulting model ranks among the top entries when evaluated on the held-out test sets for both recording modalities. The learned weights turn out to assign higher importance to deeper stages even when the target brain area is an early visual region.

Core claim

Training a set of aggregation weights on the five-stage outputs of an SE-ResNeXt-50 network allows the same backbone to serve as an effective image-to-brain encoder for multiple subjects, regions of interest, and recording modalities; the optimized weights consistently favor the later stages for both early visual cortex and higher-order inferotemporal cortex.

What carries the argument

The learned aggregation weights that modulate and screen the feature maps produced at each of the five successive stages of the fixed classification network.

If this is right

The identical aggregation procedure works for both fMRI and MEG measurements without modality-specific changes.
The same weights support prediction across multiple subjects and across both early and late visual regions of interest.
Later network stages receive higher aggregation weight than earlier ones even when the target region is early visual cortex.
The approach requires no retraining of the backbone network itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If later stages dominate, brain encoding models may benefit more from features that pool over large spatial extents than from the locally tuned early filters.
Replacing the backbone with a different pretrained network would test whether the preference for later stages is architecture-specific or general.
The same aggregation logic could be applied to other hierarchical sensory models if the goal is to predict responses from limited paired data.

Load-bearing premise

The features already present in the intermediate layers of an ImageNet-pretrained classification network are sufficient to explain brain responses once the right combination weights are found.

What would settle it

A fixed uniform average of the same five stages, or an aggregation that uses only the earliest stages, produces equal or higher prediction accuracy on the same multi-subject, multi-modality test data.

Figures

Figures reproduced from arXiv: 1907.01034 by Guy Gaziv.

**Figure 1.** Figure 1: Illustration of our ‘Learning to aggregate’ proposed method. Features arising from stage s and feature map (channel) j are modulated (scaled) by their corresponding learned coefficient β s j . All modulated features are concatenated to form the image embedding. The embeddings of every pair of images are Pearson-correlated to yield the corresponding ‘universal’ (i.e., good for all 15 subjects) RDM value. of… view at source ↗

**Figure 2.** Figure 2: shows the resulting tuning to network stages post training for EVC and IT. Surprisingly, no particular preference towards early stages of the network was recorded for the EVC-based model. We note that this result was based solely on fMRI data (either EVC or IT) and did not consider the MEG data. EVC IT Mask weights Network stage (layer) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

The Algonauts challenge requires to construct a multi-subject encoder of images to brain activity. Deep networks such as ResNet-50 and AlexNet trained for image classification are known to produce feature representations along their intermediate stages which closely mimic the visual hierarchy. However the challenges introduced in the Algonauts project, including combining data from multiple subjects, relying on very few similarity data points, solving for various ROIs, and multi-modality, require devising a flexible framework which can efficiently accommodate them. Here we build upon a recent state-of-the-art classification network (SE-ResNeXt-50) and construct an adaptive combination of its intermediate representations. While the pretrained network serves as a backbone of our model, we learn how to aggregate feature representations along five stages of the network. During learning, our method enables to modulate and screen outputs from each stage along the network as governed by the optimized objective. We applied our method to the Algonauts2019 fMRI and MEG challenges. Using the combined fMRI and MEG data, our approach was rated among the leading five for both challenges. Surprisingly we find that for both the lower and higher order areas (EVC and IT) the adaptive aggregation favors features stemming at later stages of the network.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Learned aggregation weights on a fixed SE-ResNeXt-50 place higher emphasis on later stages for both EVC and IT in Algonauts, but the result sits on untested assumptions about scale and optimization.

read the letter

The main thing to know is that this paper learns weights to combine outputs from five stages of an ImageNet-pretrained SE-ResNeXt-50 and uses the result to encode images into brain activity for the Algonauts fMRI and MEG tracks. It placed in the top five for both, and the weights ended up favoring later stages even for early visual cortex. They also trained on combined fMRI and MEG data to handle the multi-subject, multi-ROI, multi-modality setting with limited pairs. That practical recipe is the core of the work. The approach keeps the backbone frozen and only tunes the aggregation, which is a reasonable way to stay parameter-efficient while adapting to the brain data constraints. Using a recent classification network as the feature source is a sensible starting point, and the competitive ranking shows the method can deliver usable performance without heavy customization. The soft spots are around the headline observation. The abstract supplies no description of the aggregation operator, the exact loss, validation procedure, or any ablation that would rule out simpler explanations for the later-layer bias. If later stages have larger norms or more channels, a basic weighted sum can produce the same pattern even when earlier features are equally or more predictive. The paper gives no check for that, so the claim that the aggregation genuinely favors later stages for EVC rests on an assumption that is not tested in the given text. This is mainly for groups already running brain encoding models on the Algonauts benchmark or similar limited-data settings. It does not introduce a new framework or change the broader picture of DNN-brain alignment. I would send it to peer review if the full methods include the aggregation details plus at least one control for feature scale or a comparison to fixed-weight baselines. The challenge ranking is concrete enough to be worth documenting once those gaps are closed.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes learning adaptive aggregation weights over the five intermediate stages of a fixed, ImageNet-pretrained SE-ResNeXt-50 backbone to predict multi-subject fMRI and MEG responses in the Algonauts 2019 challenge. The method modulates and screens stage outputs during optimization against the challenge similarity data and reports a top-5 ranking for both modalities. The central empirical claim is that the learned weights assign higher importance to later stages for both EVC and IT.

Significance. If the stage-preference result survives controls for feature magnitude and optimization bias, the finding would be noteworthy because it suggests that later-stage features from a classification network remain useful even for early visual areas, contrary to the usual expectation that EVC should be best captured by early layers. The approach supplies a compact, learnable aggregation mechanism that accommodates multi-subject and multi-modality data with few similarity pairs. No machine-checked proofs or parameter-free derivations are present; the contribution is empirical and challenge-oriented.

major comments (2)

[Abstract] Abstract: the claim that 'the adaptive aggregation favors features stemming at later stages of the network' for both EVC and IT is load-bearing for the headline result, yet the text supplies neither the learned weight vectors, their standard errors across subjects or folds, nor any ablation that normalizes per-stage activation norms before aggregation. Without such controls it is impossible to distinguish genuine feature utility from a simple magnitude bias, since later stages of SE-ResNeXt-50 typically have larger channel counts and activation scales.
[Abstract] The manuscript states that weights are 'learned against the challenge data' but provides no description of the aggregation operator (linear, gated, or otherwise), the loss, regularization on the weights, or any test that the small number of similarity pairs does not produce degenerate or unstable solutions. This directly affects whether the reported later-stage preference can be interpreted as reflecting brain-feature alignment rather than fitting artifacts.

minor comments (1)

[Abstract] The abstract refers to 'five stages of the network' without defining the exact layer boundaries or residual-block groupings used for SE-ResNeXt-50; a table or figure listing the stages would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'the adaptive aggregation favors features stemming at later stages of the network' for both EVC and IT is load-bearing for the headline result, yet the text supplies neither the learned weight vectors, their standard errors across subjects or folds, nor any ablation that normalizes per-stage activation norms before aggregation. Without such controls it is impossible to distinguish genuine feature utility from a simple magnitude bias, since later stages of SE-ResNeXt-50 typically have larger channel counts and activation scales.

Authors: We agree that the abstract and main text would benefit from explicit reporting of the learned aggregation weights and their variability. In the revised manuscript we will add a table or figure showing the mean weights per stage (across subjects and cross-validation folds) together with standard errors. We will also include a control experiment in which per-stage feature maps are L2-normalized before aggregation; if the later-stage preference persists under this normalization, it will strengthen the claim that the result reflects feature utility rather than magnitude differences. The current evidence for the preference rests on its consistency across independent subjects and both fMRI and MEG modalities, but the proposed ablation is a valuable addition. revision: yes
Referee: [Abstract] The manuscript states that weights are 'learned against the challenge data' but provides no description of the aggregation operator (linear, gated, or otherwise), the loss, regularization on the weights, or any test that the small number of similarity pairs does not produce degenerate or unstable solutions. This directly affects whether the reported later-stage preference can be interpreted as reflecting brain-feature alignment rather than fitting artifacts.

Authors: The abstract is intentionally concise; the full methods section describes the aggregation as a learnable linear combination of stage-wise pooled features (with per-stage projection to a common dimensionality) optimized directly against the challenge similarity metric. We will expand the methods description and add a short paragraph in the main text that explicitly states the operator, the loss, the L2 regularization applied to the aggregation weights, and the cross-validation protocol used to assess solution stability. The consistency of the later-stage preference across multiple random seeds and subject-wise folds already provides evidence against degeneracy, but we will report these diagnostics more prominently. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation from learned weights, not a derivation reducing to inputs

full rationale

The paper trains aggregation weights on the Algonauts challenge data using a fixed pretrained SE-ResNeXt-50 backbone and reports the resulting stage preferences as an empirical finding. No derivation chain is claimed that reduces a 'prediction' or 'first-principles result' to the fitted inputs by construction, nor are there self-citations, uniqueness theorems, or ansatzes that close the loop. The central result is an observation about the optimized weights rather than a forced equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that ImageNet features transfer to brain encoding and on standard supervised learning assumptions; no new entities are introduced and the only free parameters are the learned aggregation weights.

free parameters (1)

stage aggregation weights
Learned parameters that decide contribution and modulation of each of the five network stages during training on the challenge data.

axioms (1)

domain assumption Intermediate layers of SE-ResNeXt-50 capture a visual hierarchy comparable to the brain
Invoked to justify using the pretrained network as backbone without further pretraining or architectural changes.

pith-pipeline@v0.9.0 · 5742 in / 1212 out tokens · 28464 ms · 2026-05-25T11:45:13.589977+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Guclu, Marcel A

U. Guclu, Marcel A. J. van Gerven, Umut Güçlü, and Marcel A. J. van Gerven. Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream. Journal of Neuroscience, 35(27):10005–10014, 7 2015

work page 2015
[2]

Hierarchical Neural Representation of Dreamed Objects Revealed by Brain Decoding with Deep Neural Network Features

Tomoyasu Horikawa and Yukiyasu Kamitani. Hierarchical Neural Representation of Dreamed Objects Revealed by Brain Decoding with Deep Neural Network Features. Frontiers in Computational Neuroscience, 11, 2017

work page 2017
[3]

Deep Residual Network Predicts Cortical Rep- resentation and Organization of Visual Features for Rapid Categorization

Haiguang Wen, Junxing Shi, Wei Chen, and Zhongming Liu. Deep Residual Network Predicts Cortical Rep- resentation and Organization of Visual Features for Rapid Categorization. Scientiﬁc Reports, 8(1):3752, 12 2018

work page 2018
[4]

Deep Convolutional modeling of human face selective columns reveals their role in pictorial face representation

Shany Grossman, Guy Gaziv, Erin M Yeagle, Michal Harel, Pierre Megevand, David M Groppe, Simon Khuvis, Jose L Herrero, Michal Irani, Ashesh D Mehta, and Rafael Malach. Deep Convolutional modeling of human face selective columns reveals their role in pictorial face representation. bioRxiv, page 444323, 10 2018

work page 2018
[5]

End-to-end deep image reconstruction from human brain activity

Guohua Shen, Kshitij Dwivedi, Kei Majima, Tomoyasu Horikawa, and Yukiyasu Kamitani. End-to-end deep image reconstruction from human brain activity. bioRxiv, page 272518, 2018

work page 2018
[6]

Seeing it all: Convolutional network layers map the function of the human visual system

Michael Eickenberg, Alexandre Gramfort, Gaël Varoquaux, and Bertrand Thirion. Seeing it all: Convolutional network layers map the function of the human visual system. NeuroImage, 152:184–194, 2017

work page 2017
[7]

Optimizing deep video representation to match brain activity

Hugo Richard, Ana Pinho, Bertrand Thirion, and Guillaume Charpiat. Optimizing deep video representation to match brain activity. Technical report, 2018

work page 2018
[8]

The Algonauts Project: A Platform for Communication between the Sciences of Biological and Artiﬁcial Intelligence

Radoslaw Martin Cichy and Gemma Roig. The Algonauts Project: A Platform for Communication between the Sciences of Biological and Artiﬁcial Intelligence. Technical report

work page
[9]

Representational similarity analysis – connecting the branches of systems neuroscience

Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. Representational similarity analysis – connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2:4, 11 2008

work page 2008
[10]

Similarity-Based Fusion of MEG and fMRI Reveals Spatio-Temporal Dynamics in Human Cortex During Visual Object Recognition

Radoslaw Martin Cichy, Dimitrios Pantazis, and Aude Oliva. Similarity-Based Fusion of MEG and fMRI Reveals Spatio-Temporal Dynamics in Human Cortex During Visual Object Recognition. 2016

work page 2016
[11]

Hypercolumns for object segmentation and ﬁne-grained localization

Bharath Hariharan, Pablo Arbelaez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and ﬁne-grained localization. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 447–456. IEEE, 6 2015

work page 2015
[12]

Squeeze-and-Excitation Networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-Excitation Networks. Technical report

work page
[13]

Benchmark Analysis of Representative Deep Neural Network Architectures

Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. Benchmark Analysis of Representative Deep Neural Network Architectures. 10 2018. 5

work page 2018

[1] [1]

Guclu, Marcel A

U. Guclu, Marcel A. J. van Gerven, Umut Güçlü, and Marcel A. J. van Gerven. Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream. Journal of Neuroscience, 35(27):10005–10014, 7 2015

work page 2015

[2] [2]

Hierarchical Neural Representation of Dreamed Objects Revealed by Brain Decoding with Deep Neural Network Features

Tomoyasu Horikawa and Yukiyasu Kamitani. Hierarchical Neural Representation of Dreamed Objects Revealed by Brain Decoding with Deep Neural Network Features. Frontiers in Computational Neuroscience, 11, 2017

work page 2017

[3] [3]

Deep Residual Network Predicts Cortical Rep- resentation and Organization of Visual Features for Rapid Categorization

Haiguang Wen, Junxing Shi, Wei Chen, and Zhongming Liu. Deep Residual Network Predicts Cortical Rep- resentation and Organization of Visual Features for Rapid Categorization. Scientiﬁc Reports, 8(1):3752, 12 2018

work page 2018

[4] [4]

Deep Convolutional modeling of human face selective columns reveals their role in pictorial face representation

Shany Grossman, Guy Gaziv, Erin M Yeagle, Michal Harel, Pierre Megevand, David M Groppe, Simon Khuvis, Jose L Herrero, Michal Irani, Ashesh D Mehta, and Rafael Malach. Deep Convolutional modeling of human face selective columns reveals their role in pictorial face representation. bioRxiv, page 444323, 10 2018

work page 2018

[5] [5]

End-to-end deep image reconstruction from human brain activity

Guohua Shen, Kshitij Dwivedi, Kei Majima, Tomoyasu Horikawa, and Yukiyasu Kamitani. End-to-end deep image reconstruction from human brain activity. bioRxiv, page 272518, 2018

work page 2018

[6] [6]

Seeing it all: Convolutional network layers map the function of the human visual system

Michael Eickenberg, Alexandre Gramfort, Gaël Varoquaux, and Bertrand Thirion. Seeing it all: Convolutional network layers map the function of the human visual system. NeuroImage, 152:184–194, 2017

work page 2017

[7] [7]

Optimizing deep video representation to match brain activity

Hugo Richard, Ana Pinho, Bertrand Thirion, and Guillaume Charpiat. Optimizing deep video representation to match brain activity. Technical report, 2018

work page 2018

[8] [8]

The Algonauts Project: A Platform for Communication between the Sciences of Biological and Artiﬁcial Intelligence

Radoslaw Martin Cichy and Gemma Roig. The Algonauts Project: A Platform for Communication between the Sciences of Biological and Artiﬁcial Intelligence. Technical report

work page

[9] [9]

Representational similarity analysis – connecting the branches of systems neuroscience

Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. Representational similarity analysis – connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2:4, 11 2008

work page 2008

[10] [10]

Similarity-Based Fusion of MEG and fMRI Reveals Spatio-Temporal Dynamics in Human Cortex During Visual Object Recognition

Radoslaw Martin Cichy, Dimitrios Pantazis, and Aude Oliva. Similarity-Based Fusion of MEG and fMRI Reveals Spatio-Temporal Dynamics in Human Cortex During Visual Object Recognition. 2016

work page 2016

[11] [11]

Hypercolumns for object segmentation and ﬁne-grained localization

Bharath Hariharan, Pablo Arbelaez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and ﬁne-grained localization. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 447–456. IEEE, 6 2015

work page 2015

[12] [12]

Squeeze-and-Excitation Networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-Excitation Networks. Technical report

work page

[13] [13]

Benchmark Analysis of Representative Deep Neural Network Architectures

Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. Benchmark Analysis of Representative Deep Neural Network Architectures. 10 2018. 5

work page 2018