SIMON: Saliency-aware Integrative Multi-view Object-centric Neural Decoding
Pith reviewed 2026-05-09 19:40 UTC · model grok-4.3
The pith
Saliency-aware sampling of multiple fixation points aligns EEG signals with object features for better zero-shot image retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SIMON integrates saliency prediction and foreground segmentation via Saliency-Aware Sampling to generate multiple foveated views that emphasize object regions, thereby reducing geometric-semantic dissociation and enabling higher-accuracy zero-shot EEG-to-image retrieval than fixed center-view methods.
What carries the argument
Saliency-Aware Sampling (SAS), which selects fixation centers from combined saliency and segmentation maps to produce multi-view foveated images that better match EEG response patterns.
Load-bearing premise
Saliency prediction and foreground segmentation will reliably choose fixation centers that match the attention patterns present in EEG responses.
What would settle it
A controlled test showing that EEG-to-image retrieval accuracy does not increase when switching from center-only views to saliency-selected multi-views on the same dataset.
Figures
read the original abstract
Recent EEG-to-image retrieval methods leverage pretrained vision encoders and foveation-inspired priors, but typically assume a fixed, center-focused view. This center bias conflicts with content-driven human attention, creating a geometric-semantic dissociation between visual features and EEG responses. We propose SIMON, a saliency-aware multi-view framework for zero-shot EEG-to-image retrieval. SIMON combines foreground segmentation and saliency prediction to select fixation centers via Saliency-Aware Sampling (SAS), then generates foveated views that emphasize informative object regions while suppressing background clutter. On THINGS-EEG, SIMON achieves state-of-the-art performance in both intra-subject and inter-subject settings, reaching an average Top-1 accuracy of 69.7% and 19.6%, respectively, consistently outperforming recent competitive baselines. Analyses across sampling granularity, EEG channel topology, and visual/brain encoder backbones further support the robustness of saliency-aware multi-view integration. Our code and models are publicly available at https://github.com/simonlink666/SIMON.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SIMON, a saliency-aware multi-view framework for zero-shot EEG-to-image retrieval. It uses Saliency-Aware Sampling (SAS) that integrates pretrained foreground segmentation and saliency prediction to select fixation centers, generating foveated views to address center-bias mismatch with content-driven EEG responses. On the THINGS-EEG dataset, it reports state-of-the-art intra-subject Top-1 accuracy of 69.7% and inter-subject accuracy of 19.6%, outperforming baselines, with supporting analyses on sampling granularity, channel topology, and encoder backbones. Code is publicly released.
Significance. If the alignment between SAS fixations and EEG-driven object regions holds, the approach could meaningfully advance neural decoding by reducing geometric-semantic dissociation in foveated multi-view setups, with implications for brain-computer interfaces and zero-shot retrieval. Public code supports reproducibility.
major comments (2)
- [Abstract] Abstract and Experiments: The central claim attributes SOTA gains (69.7% intra / 19.6% inter Top-1) to SAS resolving center-bias mismatch with EEG content-driven attention. However, no direct metric is reported (e.g., spatial correlation between SAS maps and EEG decoding weights, or fixation-EEG response overlap) to confirm that saliency/segmentation-derived centers align with regions driving the recorded signals rather than providing generic multi-view ensembling benefits. This is load-bearing for the main contribution.
- [Evaluation] Evaluation section: Performance claims are limited to a single dataset (THINGS-EEG) with no reported error bars, exact baseline details, or statistical significance tests. This undermines the robustness conclusions drawn from analyses on sampling granularity and channel topology.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with point-by-point responses and indicate the revisions we will incorporate to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and Experiments: The central claim attributes SOTA gains (69.7% intra / 19.6% inter Top-1) to SAS resolving center-bias mismatch with EEG content-driven attention. However, no direct metric is reported (e.g., spatial correlation between SAS maps and EEG decoding weights, or fixation-EEG response overlap) to confirm that saliency/segmentation-derived centers align with regions driving the recorded signals rather than providing generic multi-view ensembling benefits. This is load-bearing for the main contribution.
Authors: We agree that a direct quantitative link between SAS fixation centers and EEG-driven regions would strengthen the interpretation. Our ablations already isolate the benefit of SAS over center-biased and random multi-view baselines, indicating that the gains exceed generic ensembling. In the revised manuscript we will add a targeted analysis that correlates SAS-derived saliency maps with spatial patterns obtained from the EEG encoder (e.g., via gradient-based attribution or channel-wise decoding weights) and report overlap metrics. This addition will clarify the contribution of content-aware sampling. revision: partial
-
Referee: [Evaluation] Evaluation section: Performance claims are limited to a single dataset (THINGS-EEG) with no reported error bars, exact baseline details, or statistical significance tests. This undermines the robustness conclusions drawn from analyses on sampling granularity and channel topology.
Authors: We accept that more rigorous statistical reporting is required. The revised evaluation section will include error bars (standard deviation across subjects and multiple random seeds), precise hyperparameter and implementation details for all baselines, and statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) for the reported improvements. Although THINGS-EEG remains the primary public benchmark for EEG-to-image retrieval, we will expand the discussion to explicitly note the single-dataset limitation and outline directions for future multi-dataset validation. revision: yes
Circularity Check
No circularity detected in derivation or claims
full rationale
The paper presents an empirical framework (SIMON) that integrates pretrained saliency prediction and foreground segmentation to generate multi-view foveated inputs for EEG-to-image retrieval, then reports measured Top-1 accuracies on the public THINGS-EEG dataset. No first-principles derivation, uniqueness theorem, or predictive equation is claimed; performance numbers are obtained by running the pipeline on held-out data rather than by fitting parameters whose outputs are then re-labeled as predictions. Any self-citations are incidental and do not substitute for the experimental results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The footprint of colour in EEG signal.bioRxiv, 2025
Arash Akbarinia. The footprint of colour in EEG signal.bioRxiv, 2025. doi: 10.1101/2025.10. 06.680651. This version posted December 4, 2025
-
[2]
Visual attention: The past 25 years.Vision Research, 51(13):1484–1525, 2011
Marisa Carrasco. Visual attention: The past 25 years.Vision Research, 51(13):1484–1525, 2011
2011
-
[3]
Caplovitz, Taissa K
Patrick Cavanagh, Gideon P. Caplovitz, Taissa K. Lytchenko, Marvin R. Maechler, Peter U. Tse, and David L. Sheinberg. The architecture of object-based attention.Psychonomic Bulletin & Review, 30(5):1643–1667, 2023
2023
-
[4]
arXiv preprint arXiv:2408.06788 , year=
Hongzhou Chen, Lianghua He, Yihang Liu, and Longzhen Yang. Visual neural decoding via improved visual-EEG semantic consistency.arXiv preprint arXiv:2408.06788, 2024
-
[5]
Reproducible scaling laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, 2023
2023
-
[6]
A large and rich EEG dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022
Aleato T Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M Cichy. A large and rich EEG dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022
2022
-
[7]
Microsaccades as an overt measure of covert attention shifts
Ziad M Hafed and James J Clark. Microsaccades as an overt measure of covert attention shifts. Vision Research, 42(22):2533–2545, 2002
2002
-
[8]
Deep residual learning for im- age recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016
2016
-
[9]
SUM: Saliency unification through mamba for visual attention modeling
Alireza Hosseini, Amirhossein Kazerouni, Saeed Akhavan, Michael Brudno, and Babak Taati. SUM: Saliency unification through mamba for visual attention modeling. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025
2025
-
[10]
Hyfi: Hyper- bolic feature interpolation for brain-vision alignment
Sihyeon Jo, Wonsik Jeong, Dong-Won Heo, Yoosung Hwang, and Heung-Il Suk. Hyfi: Hyper- bolic feature interpolation for brain-vision alignment. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026
2026
-
[11]
Visual saliency and image reconstruction from EEG signals via an effective geometric deep network- based generative adversarial network.Electronics, 11(21):3637, 2022
Narges Khaleghi, Tohid Yousefi Rezaii, Soosan Beheshti, and Mohammad Reza Daliri. Visual saliency and image reconstruction from EEG signals via an effective geometric deep network- based generative adversarial network.Electronics, 11(21):3637, 2022
2022
-
[12]
what” and “where
David A. Klindt, Alexander S. Ecker, Thomas Euler, and Matthias Bethge. Neural system identification for large populations separating “what” and “where”. InAdvances in Neural Information Processing Systems (NeurIPS), pages 3506–3516, 2017
2017
-
[13]
EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces.Journal of Neural Engineering, 15(5):056013, 2018
Vernon J Lawhern, Amelia J Solon, Nicholas R Waytowich, Stephen M Gordon, Chou P Hung, and Brent J Lance. EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces.Journal of Neural Engineering, 15(5):056013, 2018
2018
-
[14]
Visual decoding and reconstruction via EEG embeddings with guided diffusion
Dongfang Li, Caixia Wei, Shichao Li, Jiachen Zou, Hu Qin, and Qunsheng Liu. Visual decoding and reconstruction via EEG embeddings with guided diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[15]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019
2019
-
[16]
The role of fixational eye movements in visual perception.Nature Reviews Neuroscience, 5(3):229–240, 2004
Susana Martinez-Conde, Stephen L Macknik, and David H Hubel. The role of fixational eye movements in visual perception.Nature Reviews Neuroscience, 5(3):229–240, 2004. 10
2004
-
[17]
Scanpaths in eye movements during pattern perception
David Noton and Lawrence Stark. Scanpaths in eye movements during pattern perception. Science, 171(3968):308–311, 1971
1971
-
[18]
O’Connell and Marvin M
Thomas P. O’Connell and Marvin M. Chun. Predicting eye movement patterns from fmri responses to natural scenes.Nature Communications, 9:5159, 2018
2018
-
[19]
Simone Palazzo, Concetto Spampinato, Isaak Kavasidis, Daniela Giordano, Joseph Schmidt, and Mubarak Shah. Decoding brain representations by multimodal learning of neural activity and visual features.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11):3833–3849, 2021
2021
-
[20]
Orienting of attention.Quarterly Journal of Experimental Psychology, 32(1): 3–25, 1980
Michael I Posner. Orienting of attention.Quarterly Journal of Experimental Psychology, 32(1): 3–25, 1980
1980
-
[21]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748–8763, 2021
2021
-
[22]
The dynamic representation of scenes.Visual Cognition, 7(1-3):17–42, 2000
Ronald A Rensink. The dynamic representation of scenes.Visual Cognition, 7(1-3):17–42, 2000
2000
-
[23]
Edmund T. Rolls. Two what, two where, visual cortical streams in humans.Neuroscience & Biobehavioral Reviews, 160:105650, 2024
2024
-
[24]
Deep learning with convolutional neural networks for EEG decoding and visualization.Human Brain Mapping, 38(11):5391–5420, 2017
Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Tonio Ball, and Wolfram Burgard. Deep learning with convolutional neural networks for EEG decoding and visualization.Human Brain Mapping, 38(11):5391–5420, 2017
2017
-
[25]
Decoding natural images from EEG for object recognition
Yizhe Song, Bingbei Liu, Xuelin Li, Nan Shi, Yijie Wang, and Xiaoguang Gao. Decoding natural images from EEG for object recognition. InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[26]
Bridging the vision-brain gap with an uncertainty-aware blur prior
Haitao Wu, Qing Li, Changqing Zhang, Zhen He, and Xiaomin Ying. Bridging the vision-brain gap with an uncertainty-aware blur prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[27]
Wenjiang Zhang, Sifeng Wang, Yuwei Su, Xinyu Li, Chen Zhang, and Suyu Zhong. Neuro- bridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirec- tional semantic alignment.arXiv preprint arXiv:2511.06836, 2025
-
[28]
Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Debra Laefer, and Ming-Ming Cheng. Bilateral reference for high-resolution dichotomous image segmentation.arXiv preprint arXiv:2401.03407, 2024. 11 A Experiment Configuration A.1 Datasets and Preprocessing Dataset.We evaluate our method on the THINGS-EEG dataset [6], a large-scale benchmark collected using a ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.