AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space

Huzheng Yang; James Gee; Jianbo Shi

arxiv: 2406.18344 · v2 · submitted 2024-06-26 · 💻 cs.CV

AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space

Huzheng Yang , James Gee , Jianbo Shi This is my paper

Pith reviewed 2026-05-23 23:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual conceptsfMRI alignmentchannel clusteringuniversal feature spaceobject segmentationbrain-guided featuresdeep network comparison

0 comments

The pith

Aligning deep network channels to predict brain fMRI responses reveals shared visual concepts as recurring clusters that segment objects without supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a universal feature space by training alignments between channels in different deep networks so that they jointly predict fMRI voxel responses to images. Once aligned, channels that recur across models form clusters that map onto distinct brain regions. When these cluster responses are traced back onto the original images, they delineate semantically coherent object parts. The same alignment also yields layer-wise maps of how visual information is transformed inside each network, enabling direct comparisons. A reader would care because the method grounds the discovery of visual concepts in brain data rather than human-provided labels or task-specific losses.

Core claim

Training channel alignment solely to predict fMRI voxel responses produces a shared feature space in which channels from differently trained networks form recurring clusters; these clusters correspond to distinct brain regions and, when projected onto images, delineate semantically meaningful object segments even without any supervised segmentation decoder. The same construction quantifies how visual information is processed through successive layers of each network.

What carries the argument

Universal channel alignment trained to predict fMRI voxel responses, which groups channels into brain-region-corresponding clusters.

If this is right

Channels shared across models indicate that visual concepts form independently of the original training objective.
Cluster-to-brain-region correspondence supplies a quantitative map of visual processing stages.
Object segments emerge directly from the aligned channel responses without any pixel-level supervision.
Layer-wise processing comparisons become possible between arbitrary networks inside the same aligned space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might be extended to test whether the discovered clusters remain stable when the fMRI training data are replaced by responses from a different subject population.
If the segments correspond to brain regions, the approach could be used to generate pseudo-labels for unsupervised segmentation benchmarks.
The alignment could be applied to non-visual modalities to check whether analogous cross-model concept clusters appear outside vision.

Load-bearing premise

That the clusters obtained from fMRI alignment reflect genuine visual concepts rather than artifacts of the alignment process or correlations specific to the training images and brain data.

What would settle it

Finding that the same clusters fail to match known functional brain regions or produce inconsistent object segments when tested on a new image dataset or a different set of network architectures.

Figures

Figures reproduced from arXiv: 2406.18344 by Huzheng Yang, James Gee, Jianbo Shi.

**Figure 1.** Figure 1: Transform the hidden channel activation of deep-nets into visual brain voxels’ response. If a picture is worth a thousand words, the main idea is that the brain’s thousands of voxels can be thought of as alphabets for these words that describe an image. Just as alphabets must be combined to form words and phrases with meanings, we need to find the grouping of brain voxels and their network channel counterp… view at source ↗

**Figure 2.** Figure 2: From the 768D feature on CLIP layer-6, we extract different levels of segmentation by [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Cosine similarity of channel activation on the same image inputs. Just as human languages might consist of distinct alphabets, features across different models appear superficially in embedding spaces as almost mutually orthogonal ( [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Spectral clustering in the universal channel aligned feature space. The image pixels are [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Unsupervised segmentation scores from spectral clustering on each CLIP layer. ImageNet [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The figure-ground visual concepts in CLIP layer-5. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: The same figure-ground visual concepts are found in CLIP, DINO and MAE. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Category visual concepts in CLIP layer-9. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Trajectory of feature progression in layers for six example pixels. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Transition probability of visual concepts from CLIP layer-3 to layer-4. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Brain Region of Interests (ROIs). V1v: ventral stream, V1d: dorsal stream [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Spectral clustering in the universal channel aligned feature space. The image pixels are [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Spectral clustering in the universal channel aligned feature space. The image pixels are [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Spectral clustering in the universal channel aligned feature space. The image pixels are [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Spectral clustering in the universal channel aligned feature space. The image pixels are [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Mean activation of foreground or background pixels at each layer of CLIP, DINO and [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Category visual concepts in CLIP Layer 9. [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: Trajectory of feature progression in from layer to layer, in the 2D spectral-tSNE space. [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Trajectory of feature progression in from layer to layer, in the 2D spectral-tSNE space. [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗

**Figure 20.** Figure 20: Trajectory of feature progression in from layer to layer, in the 2D spectral-tSNE space. [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗

**Figure 21.** Figure 21: Trajectory of feature progression in from layer to layer, in the 2D spectral-tSNE space. [PITH_FULL_IMAGE:figures/full_fig_p025_21.png] view at source ↗

**Figure 22.** Figure 22: Trajectory of feature progression in from layer to layer, in the 2D spectral-tSNE space. [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗

read the original abstract

We study the intriguing connection between visual data, deep networks, and the brain. Our method creates a universal channel alignment by using brain voxel fMRI response prediction as the training objective. We discover that deep networks, trained with different objectives, share common feature channels across various models. These channels can be clustered into recurring sets, corresponding to distinct brain regions, indicating the formation of visual concepts. Tracing the clusters of channel responses onto the images, we see semantically meaningful object segments emerge, even without any supervised decoder. Furthermore, the universal feature alignment and the clustering of channels produce a picture and quantification of how visual information is processed through the different network layers, which produces precise comparisons between the networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

fMRI voxel prediction aligns channels across vision models and yields clusters tied to brain regions plus unsupervised segments, but the abstract supplies no numbers or controls to check if this is more than alignment artifacts.

read the letter

The main point is that the authors align feature channels from different networks by training solely to predict fMRI voxel responses, then cluster the aligned channels. The clusters line up with distinct brain areas and, when traced back to images, produce object-like segments without any label supervision. They also use the setup to compare how layers in different models process information.

Referee Report

2 major / 1 minor

Summary. The paper introduces AlignedCut, a method that creates a universal channel alignment across deep networks by training to predict fMRI voxel responses as the objective. It claims that networks trained with different objectives share common feature channels, which can be clustered into recurring sets corresponding to distinct brain regions (indicating visual concepts). Tracing cluster responses onto images yields semantically meaningful object segments without any supervised decoder. The approach also quantifies visual information processing across layers to enable precise network comparisons.

Significance. If the central claims hold after proper controls, the work would offer a brain-guided lens for discovering and interpreting shared visual concepts in deep networks, along with a quantitative framework for layer-wise model comparison. The fMRI-prediction alignment is a distinctive choice that could bridge CV and neuroscience, but the absence of any reported quantitative results, ablations, or statistical validation limits current assessment of impact.

major comments (2)

[Method] Method (alignment and clustering procedure): The sole training objective is fMRI voxel response prediction, after which channels are clustered by similarity in their aligned prediction profiles. This makes the reported correspondence between clusters and distinct brain regions a direct consequence of the supervision signal rather than an independent discovery of visual concepts. No ablation against non-brain or randomized targets is described, which is load-bearing for the claim that clusters reflect genuine emergent concepts rather than alignment artifacts.
[Results] Results/Experiments: No quantitative metrics (e.g., cluster-brain region correspondence scores, segmentation IoU, statistical significance tests), ablation studies, or error analysis are reported to support that the traced segments are semantically meaningful or that the layer-wise quantification is robust. Visual inspection alone cannot substantiate the central claims about concept formation and network comparisons.

minor comments (1)

[Abstract] Abstract: The description of the clustering step and its mapping to brain regions could be clarified to avoid implying independence from the fMRI supervision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our method and experimental validation. The comments correctly identify areas where additional controls and metrics would strengthen the manuscript. We address each point below and commit to revisions that incorporate the suggested analyses.

read point-by-point responses

Referee: [Method] Method (alignment and clustering procedure): The sole training objective is fMRI voxel response prediction, after which channels are clustered by similarity in their aligned prediction profiles. This makes the reported correspondence between clusters and distinct brain regions a direct consequence of the supervision signal rather than an independent discovery of visual concepts. No ablation against non-brain or randomized targets is described, which is load-bearing for the claim that clusters reflect genuine emergent concepts rather than alignment artifacts.

Authors: The fMRI-based objective is deliberately chosen to ground the alignment in brain responses, enabling the discovery of how shared channels across networks map to brain regions. The recurring clusters across differently trained networks provide evidence of emergent shared concepts. We agree that ablations using randomized or non-brain targets are necessary to rule out alignment artifacts and will add these experiments to the revised manuscript. revision: yes
Referee: [Results] Results/Experiments: No quantitative metrics (e.g., cluster-brain region correspondence scores, segmentation IoU, statistical significance tests), ablation studies, or error analysis are reported to support that the traced segments are semantically meaningful or that the layer-wise quantification is robust. Visual inspection alone cannot substantiate the central claims about concept formation and network comparisons.

Authors: The current results emphasize qualitative visualization of semantic segments and layer-wise processing. We concur that quantitative support is required for the claims. In the revision we will report cluster-brain region correspondence scores, segmentation IoU values against ground-truth object masks, statistical significance tests, and additional ablation studies. revision: yes

Circularity Check

1 steps flagged

Clustering of aligned channels reduces to fMRI voxel prediction profiles by construction

specific steps

fitted input called prediction [Abstract]
"Our method creates a universal channel alignment by using brain voxel fMRI response prediction as the training objective. We discover that deep networks, trained with different objectives, share common feature channels across various models. These channels can be clustered into recurring sets, corresponding to distinct brain regions, indicating the formation of visual concepts."

The alignment objective forces channels to be grouped by their similarity to the same fMRI voxel targets; subsequent clustering therefore recovers brain-region correspondence by construction from the fitted prediction profiles, rendering the 'discovery' of visual concepts and brain-region mapping tautological with the training signal.

full rationale

The paper trains a universal channel alignment whose sole objective is fMRI voxel response prediction, then clusters the aligned channels and interprets the resulting groups as recurring visual concepts that map to distinct brain regions. Because alignment directly ties channels to shared voxel targets, any clustering necessarily groups channels according to similarity in their fMRI prediction profiles; the claimed correspondence to brain regions and emergence of semantic segments is therefore a direct statistical consequence of the supervision signal rather than an independent discovery. No ablation against non-brain targets is described that would break this reduction. This matches the fitted-input-called-prediction pattern at the core of the method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; the central claim rests on the unstated premise that fMRI responses constitute an appropriate and sufficient supervisory signal for recovering visual concepts.

axioms (1)

domain assumption fMRI voxel responses from visual cortex provide a valid training objective for aligning feature channels across networks
The entire alignment procedure is defined by this choice of objective.

pith-pipeline@v0.9.0 · 5643 in / 1187 out tokens · 25633 ms · 2026-05-23T23:51:53.066657+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
cs.LG 2026-04 unverdicted novelty 7.0

NeuroFlow is the first unified flow model for bidirectional visual encoding and decoding from neural activity using NeuroVAE and cross-modal flow matching.
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
cs.LG 2026-04 unverdicted novelty 6.0

A meta-optimized in-context learning approach enables training-free cross-subject semantic visual decoding from fMRI by inferring individual neural encoding patterns via hierarchical inference on a few examples.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 2 Pith papers

[1]

6 Fowlkes, C., Belongie, S., Chung, F., and Malik, J. (2004). Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence , 26(2):214–225. 4, 9 Gandelsman, Y ., Efros, A. A., and Steinhardt, J. (2024). Interpreting CLIP’s Image Representation via Text-Based Decomposition. In The Twelfth International Conferen...

work page arXiv 2004
[2]

Appendix B summarizes background of brain ROIs

work page
[3]

Additional regularization terms 2.2

Appendix C is implementation details 2.1. Additional regularization terms 2.2. Brain encoding model training loss function 2.3. Unsupervised segmentation evaluation pipeline 2.4. Nystrom-like approximation for t-SNE

work page
[4]

Appendix D lists more image examples from the 3D spectral-tSNE

work page
[5]

Appendix E lists figure-ground channel activation for every model and layer

work page
[6]

Appendix F lists more example category-specific visual concepts

work page
[7]

12 B Brain Region Background Knowledge Figure 11: Brain Region of Interests (ROIs)

Appendix G lists more example pixels from the 2D spectral-tSNE information flow. 12 B Brain Region Background Knowledge Figure 11: Brain Region of Interests (ROIs). V1v: ventral stream, V1d: dorsal stream. Table 2: Known function and selectivity of brain region of interests (ROIs). ROI name V1 V2 V3 V4 EBA FBA OFA FFA OPA PPA OWFA VWFA Known Function/Sele...

work page 2017
[8]

eigen-constraint regularization, 3) zero-centered regularization, and 4) covariance regularization: L = Lbrain + λeigenLeigen + λzero Lzero + λcovLcov (10) where we set λeigen = 1, λzero = 0.01, λcov = 0.01. 14 C.4 Oracle-based Unsupervised Segmentation Evaluation Pipeline Our unsupervised segmentation pipeline aims to benchmark and compare the performanc...

work page
[9]

Apply spectral clustering jointly across all images, taking the top 10 eigenvectors

work page
[10]

For each class of object (plus one background class), use ground-truth labels from the dataset to mask out the pixels and their eigenvectors, and then use the mean of the eigenvectors to define a center for each class

work page
[11]

Compute the cosine similarity of each pixel to all class centers

work page
[12]

For each pixel, if the maximum similarity to all classes is less than a threshold value, assign this pixel to the background class

work page
[13]

There’s one hyper-parameter, the threshold value that requires different optimal value for each layer of CLIP

Assign pixels (with a similarity greater than the threshold value) to the class with the maximum similarity. There’s one hyper-parameter, the threshold value that requires different optimal value for each layer of CLIP. To ensure a fair comparison across all layers, the threshold value is grid-searched from 10 evenly spaced values between 0 and 1, the max...

work page

[1] [1]

6 Fowlkes, C., Belongie, S., Chung, F., and Malik, J. (2004). Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence , 26(2):214–225. 4, 9 Gandelsman, Y ., Efros, A. A., and Steinhardt, J. (2024). Interpreting CLIP’s Image Representation via Text-Based Decomposition. In The Twelfth International Conferen...

work page arXiv 2004

[2] [2]

Appendix B summarizes background of brain ROIs

work page

[3] [3]

Additional regularization terms 2.2

Appendix C is implementation details 2.1. Additional regularization terms 2.2. Brain encoding model training loss function 2.3. Unsupervised segmentation evaluation pipeline 2.4. Nystrom-like approximation for t-SNE

work page

[4] [4]

Appendix D lists more image examples from the 3D spectral-tSNE

work page

[5] [5]

Appendix E lists figure-ground channel activation for every model and layer

work page

[6] [6]

Appendix F lists more example category-specific visual concepts

work page

[7] [7]

12 B Brain Region Background Knowledge Figure 11: Brain Region of Interests (ROIs)

Appendix G lists more example pixels from the 2D spectral-tSNE information flow. 12 B Brain Region Background Knowledge Figure 11: Brain Region of Interests (ROIs). V1v: ventral stream, V1d: dorsal stream. Table 2: Known function and selectivity of brain region of interests (ROIs). ROI name V1 V2 V3 V4 EBA FBA OFA FFA OPA PPA OWFA VWFA Known Function/Sele...

work page 2017

[8] [8]

eigen-constraint regularization, 3) zero-centered regularization, and 4) covariance regularization: L = Lbrain + λeigenLeigen + λzero Lzero + λcovLcov (10) where we set λeigen = 1, λzero = 0.01, λcov = 0.01. 14 C.4 Oracle-based Unsupervised Segmentation Evaluation Pipeline Our unsupervised segmentation pipeline aims to benchmark and compare the performanc...

work page

[9] [9]

Apply spectral clustering jointly across all images, taking the top 10 eigenvectors

work page

[10] [10]

For each class of object (plus one background class), use ground-truth labels from the dataset to mask out the pixels and their eigenvectors, and then use the mean of the eigenvectors to define a center for each class

work page

[11] [11]

Compute the cosine similarity of each pixel to all class centers

work page

[12] [12]

For each pixel, if the maximum similarity to all classes is less than a threshold value, assign this pixel to the background class

work page

[13] [13]

There’s one hyper-parameter, the threshold value that requires different optimal value for each layer of CLIP

Assign pixels (with a similarity greater than the threshold value) to the class with the maximum similarity. There’s one hyper-parameter, the threshold value that requires different optimal value for each layer of CLIP. To ensure a fair comparison across all layers, the threshold value is grid-searched from 10 evenly spaced values between 0 and 1, the max...

work page