Subject-Aware Multi-Granularity Alignment for Zero-Shot EEG-to-Image Retrieval

Duanpo Wu; Haiqi Xu; Jiale Xu; Lin Jiang; Qingshan She; Zhenzhong Kuang

arxiv: 2604.17782 · v1 · submitted 2026-04-20 · 💻 cs.CV

Subject-Aware Multi-Granularity Alignment for Zero-Shot EEG-to-Image Retrieval

Lin Jiang , Qingshan She , Jiale Xu , Haiqi Xu , Duanpo Wu , Zhenzhong Kuang This is my paper

Pith reviewed 2026-05-10 05:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot EEG-to-image retrievalsubject-aware alignmentmulti-granularity featurescoarse-to-fine alignmentcross-modal retrievalvisual neural decodingbrain-computer interfacesTHINGS-EEG

0 comments

The pith

SAMGA constructs subject-aware visual targets from multi-granularity features to align EEG signals for zero-shot image retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the SAMGA framework to decode images from EEG in a zero-shot manner. It builds a visual supervision target by adaptively combining multiple scales of features from a pretrained vision model, so that training accounts for how different subjects match best to different levels of visual detail. A shared encoder then performs alignment first at a coarse semantic level to stabilize the geometry across subjects, then at a finer level to sharpen instance discrimination. This design keeps the model able to run without subject-specific information once trained. The result is reported gains over prior methods on the THINGS-EEG benchmark in both same-subject and cross-subject evaluations.

Core claim

SAMGA first constructs a subject-aware visual supervision target by adaptively aggregating multiple intermediate representations from a pretrained vision encoder, allowing the model to absorb subject-dependent granularity deviations during training while preserving subject-agnostic inference. Building on this adaptive target construction, a coarse-to-fine cross-modal alignment strategy is further designed with a shared encoder wherein the coarse stage stabilizes the shared semantic geometry and reduces subject-induced distribution shift, and the fine stage further improves instance-level retrieval discrimination.

What carries the argument

Adaptive aggregation of multiple intermediate representations from a pretrained vision encoder to produce subject-aware visual supervision targets, paired with coarse-to-fine alignment inside a shared encoder.

Load-bearing premise

Adaptively aggregating multiple intermediate representations from a pretrained vision encoder produces a subject-aware visual supervision target that enables effective coarse-to-fine alignment while preserving subject-agnostic inference at test time.

What would settle it

An ablation that replaces the adaptive multi-granularity aggregation with a single fixed-scale target and measures whether inter-subject Top-1 accuracy on THINGS-EEG falls to the level of earlier single-target baselines.

read the original abstract

Zero-shot EEG-to-image retrieval aims to decode perceived visual content from electroencephalography (EEG) by aligning neural responses with pretrained visual representations, providing a promising route toward scalable visual neural decoding and practical brain-computer interfaces. However, robust EEG-to-image retrieval remains challenging, because prior methods usually rely on either a single fixed visual target or a subject-invariant target construction scheme. Such designs overlook two important properties of visually evoked EEG signals: they preserve information across multiple representational scales, and the visual granularity best matched to EEG may vary across subjects. To address these issues, subject-aware multi-granularity alignment (SAMGA) framework is proposed for zero-shot EEG-to-image retrieval. SAMGA first constructs a subject-aware visual supervision target by adaptively aggregating multiple intermediate representations from a pretrained vision encoder, allowing the model to absorb subject-dependent granularity deviations during training while preserving subject-agnostic inference. Building on this adaptive target construction, a coarse-to-fine cross-modal alignment strategy is further designed with a shared encoder wherein the coarse stage stabilizes the shared semantic geometry and reduces subject-induced distribution shift, and the fine stage further improves instance-level retrieval discrimination. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves 91.3% Top-1 and 98.8% Top-5 accuracy in the intra-subject setting, and 34.4% Top-1 and 64.8% Top-5 accuracy in the inter-subject setting, outperforming recent state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAMGA gets strong THINGS-EEG numbers by building subject-aware targets from multi-layer vision features and using coarse-to-fine alignment, but the inter-subject gains rest on an aggregation step whose details and controls are still thin.

read the letter

The paper introduces SAMGA, which builds subject-aware visual targets by adaptively aggregating multiple intermediate layers from a pretrained vision encoder, then aligns EEG signals to those targets with a shared encoder in two stages: a coarse pass to stabilize semantics and reduce subject shift, followed by a fine pass for better instance discrimination. This is a clear step beyond single-target or fully invariant baselines, and the reported numbers—91.3% top-1 intra-subject and 34.4% top-1 inter-subject on THINGS-EEG—look like real progress on a hard benchmark. The idea of letting the target adapt to subject-specific granularity while keeping inference subject-agnostic is sensible and directly targets a known pain point in EEG decoding. The coarse-to-fine schedule also has intuitive appeal for avoiding collapse in the shared space. That said, the aggregation mechanism is described at a high level with no concrete equations or pseudocode in the summary, and there are no ablations showing that the subject-aware component actually changes the targets in a meaningful way or that removing it hurts inter-subject performance. Without those checks, it is difficult to rule out that the gains come mostly from the shared encoder or other unstated training choices. The work is aimed at people building practical zero-shot neural decoders for BCI. Readers working on EEG-to-image or cross-modal alignment will find the benchmark results and the framing useful, but they will want the code and the missing controls before treating the inter-subject claim as settled. It is worth sending to referees because the problem matters and the framework is a reasonable, targeted extension even if the current evidence needs strengthening.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Subject-Aware Multi-Granularity Alignment (SAMGA) framework for zero-shot EEG-to-image retrieval. It constructs a subject-aware visual supervision target by adaptively aggregating multiple intermediate representations from a pretrained vision encoder, allowing absorption of subject-dependent granularity deviations during training while keeping inference subject-agnostic. A coarse-to-fine cross-modal alignment strategy with a shared encoder is then used: the coarse stage stabilizes semantic geometry and reduces subject-induced shift, while the fine stage improves instance-level discrimination. On the THINGS-EEG benchmark, SAMGA reports 91.3% Top-1 / 98.8% Top-5 intra-subject and 34.4% Top-1 / 64.8% Top-5 inter-subject accuracy, outperforming recent SOTA methods.

Significance. If the performance claims and underlying mechanism hold, the work is significant as it explicitly addresses subject variability and multi-scale representational properties in EEG signals for visual decoding, a key challenge in prior single-target or subject-invariant approaches. The subject-aware target construction combined with coarse-to-fine alignment offers a principled way to improve robustness without test-time subject information, with potential implications for practical BCIs. The benchmark gains are notable, but the absence of implementation details, loss formulations, ablations, and statistical validation in the abstract limits assessment of whether the gains stem from the proposed innovations or other factors.

major comments (2)

[Abstract] Abstract: The central performance claims (91.3% intra / 34.4% inter Top-1) depend on the adaptive aggregation producing a subject-aware target that meaningfully differs from a subject-invariant average and enables the coarse-to-fine alignment. However, no concrete mechanism (e.g., learned per-subject weights, subject embedding, layer-wise attention, or equations defining the aggregation) is supplied, nor any ablation showing the target differs from a fixed combination. This makes it impossible to verify whether the inter-subject gains are reliable or if subject identity leaks into the fine-stage loss.
[Abstract] Abstract: The coarse stage is claimed to 'stabilize the shared semantic geometry and reduce subject-induced distribution shift,' but without loss formulations, training procedures, or details on how the stages interact (e.g., shared encoder architecture or scheduling), it is unclear whether the reported inter-subject results are supported or affected by post-hoc choices. No statistical tests or variance measures accompany the accuracy numbers.

minor comments (1)

[Abstract] The abstract would benefit from naming the specific pretrained vision encoder (e.g., CLIP ViT or ResNet) and the THINGS-EEG dataset split details used for intra- vs. inter-subject evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight opportunities to improve clarity in the abstract and provide stronger empirical support for the proposed mechanisms. We address each point below and have revised the manuscript accordingly, including updates to the abstract, addition of ablations, and inclusion of statistical measures.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (91.3% intra / 34.4% inter Top-1) depend on the adaptive aggregation producing a subject-aware target that meaningfully differs from a subject-invariant average and enables the coarse-to-fine alignment. However, no concrete mechanism (e.g., learned per-subject weights, subject embedding, layer-wise attention, or equations defining the aggregation) is supplied, nor any ablation showing the target differs from a fixed combination. This makes it impossible to verify whether the inter-subject gains are reliable or if subject identity leaks into the fine-stage loss.

Authors: We thank the referee for this observation. The full mechanism is described in Section 3.2, where subject embeddings are used to compute learned per-subject weights for adaptive layer-wise aggregation of intermediate representations from the pretrained vision encoder (with the explicit formulation given in Equation 2). We acknowledge that the abstract omitted a concise reference to this process. In the revised manuscript we have updated the abstract to briefly describe the subject-embedding-driven adaptive aggregation. We have also added a dedicated ablation (new Table 3) comparing subject-aware targets against fixed combinations (average pooling and single-layer baselines), showing statistically significant drops in inter-subject accuracy when subject-specific weighting is removed. These results confirm that the gains arise from the proposed mechanism and that inference remains subject-agnostic, with no leakage of subject identity into the fine-stage loss. revision: yes
Referee: [Abstract] Abstract: The coarse stage is claimed to 'stabilize the shared semantic geometry and reduce subject-induced distribution shift,' but without loss formulations, training procedures, or details on how the stages interact (e.g., shared encoder architecture or scheduling), it is unclear whether the reported inter-subject results are supported or affected by post-hoc choices. No statistical tests or variance measures accompany the accuracy numbers.

Authors: We appreciate the referee's request for greater transparency. The loss formulations are provided in Section 3.3 (Equations 3 and 4): the coarse stage uses a contrastive semantic alignment loss on the aggregated targets, while the fine stage applies an instance-level discrimination loss; both stages share the same cross-modal encoder whose architecture and two-stage training schedule are detailed in Sections 3.4 and 4.2. To address the absence of statistical validation, we have added per-run standard deviations and paired t-test p-values to the main results table (revised Table 1) and included a short statement on statistical significance in the revised abstract. These additions demonstrate that the reported inter-subject improvements are robust and directly attributable to the coarse-to-fine schedule rather than post-hoc decisions. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework evaluated on external benchmark

full rationale

The paper introduces SAMGA as an engineering method: adaptive aggregation of pretrained vision-encoder layers to form subject-aware targets, followed by coarse-to-fine alignment via a shared encoder. Performance is reported as measured Top-1/Top-5 accuracies on the external THINGS-EEG benchmark (intra- and inter-subject splits), not derived from or forced by the method's own fitted parameters. No equations, uniqueness theorems, or self-citations are presented that reduce the central claims to self-definition or input renaming. The derivation chain consists of standard training and evaluation steps whose outputs (accuracy numbers) are independent of the construction once the benchmark data are fixed.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from multimodal learning and neuroscience: that pretrained vision encoders capture useful hierarchical visual features relevant to EEG signals, and that the THINGS-EEG benchmark is a valid testbed. No new physical entities are postulated. Free parameters are the adaptive aggregation weights and alignment hyperparameters, which are fitted during training on the benchmark data.

free parameters (1)

adaptive aggregation parameters for multi-granularity features
Learned weights or gating mechanism that combines intermediate representations from the vision encoder in a subject-dependent manner.

axioms (1)

domain assumption Intermediate layers of a pretrained vision encoder provide representations at multiple granularities that are relevant to visually evoked EEG signals.
Invoked when constructing the subject-aware visual supervision target from multiple intermediate representations.

pith-pipeline@v0.9.0 · 5589 in / 1472 out tokens · 43338 ms · 2026-05-10T05:12:54.771473+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[1]

Kamitani, Y., & Tong, F. (2005). Decoding the visual and subjective contents of the human brain. Nature neuroscience, 8(5), 679-685

work page 2005
[2]

K., Quek, G

Robinson, A. K., Quek, G. L., & Carlson, T. A. (2023). Visual representations: insights from neural decoding. Annual Review of Vision Science, 9(1), 313-335

work page 2023
[3]

Du, C., Fu, K., Li, J., & He, H. (2023). Decoding visual neural representations by multimodal learning of brain-visual-linguistic features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 10760-10777

work page 2023
[4]

Song, Y., Liu, B., Li, X., Shi, N., Wang, Y., & Gao, X. (2023). Decoding natural images from EEG for object recognition. arXiv preprint arXiv:2308.13234

work page arXiv 2023
[5]

T., Dwivedi, K., Roig, G., & Cichy, R

Gifford, A. T., Dwivedi, K., Roig, G., & Cichy, R. M. (2022). A large and rich EEG dataset for modeling human visual object recognition. NeuroImage, 264, 119754

work page 2022
[6]

Song, Y., Wang, Y., He, H., & Gao, X. (2025). Recognizing natural images from eeg with language-guided contrastive learning. IEEE Transactions on Neural Networks and Learning Systems

work page 2025
[7]

Xiong, D., Hu, L., Jin, J., Ding, Y., Tan, C., Zhang, J., & Tian, Y. (2025). Interpretable Cross-Modal Alignment Network for EEG Visual Decoding With Algorithm Unrolling. IEEE Transactions on Neural Networks and Learning Systems

work page 2025
[8]

Wu, H., Li, Q., Zhang, C., He, Z., & Ying, X. (2025). Bridging the vision- brain gap with an uncertainty-aware blur prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2246-2257)

work page 2025
[9]

He, B., Sohrabpour, A., Brown, E., & Liu, Z. (2018). Electrophysiological source imaging: a noninvasive window to brain dynamics. Annual review of biomedical engineering, 20(1), 171-196. [10]Kaplan, A. Y., Fingelkurts, A. A., Fingelkurts, A. A., Borisov, S. V., & Darkhovsky, B. S. (2005). Nonstationary nature of the brain activity as revealed by EEG/MEG: ...

work page arXiv 2018

[1] [1]

Kamitani, Y., & Tong, F. (2005). Decoding the visual and subjective contents of the human brain. Nature neuroscience, 8(5), 679-685

work page 2005

[2] [2]

K., Quek, G

Robinson, A. K., Quek, G. L., & Carlson, T. A. (2023). Visual representations: insights from neural decoding. Annual Review of Vision Science, 9(1), 313-335

work page 2023

[3] [3]

Du, C., Fu, K., Li, J., & He, H. (2023). Decoding visual neural representations by multimodal learning of brain-visual-linguistic features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 10760-10777

work page 2023

[4] [4]

Song, Y., Liu, B., Li, X., Shi, N., Wang, Y., & Gao, X. (2023). Decoding natural images from EEG for object recognition. arXiv preprint arXiv:2308.13234

work page arXiv 2023

[5] [5]

T., Dwivedi, K., Roig, G., & Cichy, R

Gifford, A. T., Dwivedi, K., Roig, G., & Cichy, R. M. (2022). A large and rich EEG dataset for modeling human visual object recognition. NeuroImage, 264, 119754

work page 2022

[6] [6]

Song, Y., Wang, Y., He, H., & Gao, X. (2025). Recognizing natural images from eeg with language-guided contrastive learning. IEEE Transactions on Neural Networks and Learning Systems

work page 2025

[7] [7]

Xiong, D., Hu, L., Jin, J., Ding, Y., Tan, C., Zhang, J., & Tian, Y. (2025). Interpretable Cross-Modal Alignment Network for EEG Visual Decoding With Algorithm Unrolling. IEEE Transactions on Neural Networks and Learning Systems

work page 2025

[8] [8]

Wu, H., Li, Q., Zhang, C., He, Z., & Ying, X. (2025). Bridging the vision- brain gap with an uncertainty-aware blur prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2246-2257)

work page 2025

[9] [9]

He, B., Sohrabpour, A., Brown, E., & Liu, Z. (2018). Electrophysiological source imaging: a noninvasive window to brain dynamics. Annual review of biomedical engineering, 20(1), 171-196. [10]Kaplan, A. Y., Fingelkurts, A. A., Fingelkurts, A. A., Borisov, S. V., & Darkhovsky, B. S. (2005). Nonstationary nature of the brain activity as revealed by EEG/MEG: ...

work page arXiv 2018