pith. sign in

arxiv: 2605.00401 · v1 · submitted 2026-05-01 · 💻 cs.CV · q-bio.NC

SIMON: Saliency-aware Integrative Multi-view Object-centric Neural Decoding

Pith reviewed 2026-05-09 19:40 UTC · model grok-4.3

classification 💻 cs.CV q-bio.NC
keywords EEG-to-image retrievalsaliency-aware samplingmulti-view foveationzero-shot decodingobject-centric neural decodingTHINGS-EEGforeground segmentation
0
0 comments X

The pith

Saliency-aware sampling of multiple fixation points aligns EEG signals with object features for better zero-shot image retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that fixed center-focused views create a mismatch between image features and the content-driven attention captured in EEG signals. SIMON fixes this by combining saliency maps with foreground segmentation to pick multiple informative fixation centers, then builds foveated multi-view representations that highlight objects and reduce background noise. On the THINGS-EEG benchmark this produces state-of-the-art retrieval in both intra-subject and inter-subject settings while outperforming recent baselines. The gains hold across changes in sampling detail, brain channel layout, and choice of visual or neural encoders.

Core claim

SIMON integrates saliency prediction and foreground segmentation via Saliency-Aware Sampling to generate multiple foveated views that emphasize object regions, thereby reducing geometric-semantic dissociation and enabling higher-accuracy zero-shot EEG-to-image retrieval than fixed center-view methods.

What carries the argument

Saliency-Aware Sampling (SAS), which selects fixation centers from combined saliency and segmentation maps to produce multi-view foveated images that better match EEG response patterns.

Load-bearing premise

Saliency prediction and foreground segmentation will reliably choose fixation centers that match the attention patterns present in EEG responses.

What would settle it

A controlled test showing that EEG-to-image retrieval accuracy does not increase when switching from center-only views to saliency-selected multi-views on the same dataset.

Figures

Figures reproduced from arXiv: 2605.00401 by Chun-Shu Wei, Ji-Hwa Tsai, YuSheng Lin.

Figure 1
Figure 1. Figure 1: (a) Quantitative Assessment of Geometric-Semantic Dissociation; (b) Illustration of the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the proposed SIMON framework. The pipeline integrates visual saliency estimation with a Saliency-Aware Sampling (SAS) strategy. The selected high-resolution crops are processed by a vision encoder and projected into hyperbolic space to align with EEG embeddings via contrastive learning. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the saliency-aware multi-view generation From left to right: input image, semantic saliency map, foreground mask, sampled fixation centers, and the resulting foveated views. The sampled centers adapt to off-center semantic regions, allowing SIMON to preserve informative foreground details, such as the extremities of the aardvark and the elongated structure of the airboat, that are often su… view at source ↗
Figure 4
Figure 4. Figure 4: Dissociation-conditioned analysis of the Top-1 intra-subject retrieval gain [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity of Top-1 and Top-5 retrieval accuracy to the sampling granularity . The single-view baseline (K = 1) performs relatively poorly, reaching 68.2% in the intra-subject setting and 18.9% in the inter-subject setting, suggesting that a single foveal crop cannot capture the full semantics of the visual stimulus. As K increases, performance improves in both settings, indicating that aggregating multip… view at source ↗
Figure 6
Figure 6. Figure 6: Impact of channel combinations on retrieval performance and channel grouping visualization. (c) Channel grouping. The EEG electrodes (black dots, 10–10 montage) are grouped into five regions: Frontal (F), Central (C), Temporal (T), Parietal (P), and Occipital (O). (d) Impact of channel combinations on intra-subject retrieval performance. (e) Impact of channel combinations on inter-subject retrieval perform… view at source ↗
Figure 8
Figure 8. Figure 8: Relative Top-5 accuracy gain of SI￾MON across different EEG and image encoders. The choice of vision encoder affects the absolute retrieval scores, but it does not substantially change the relative ordering among EEG encoders. This pattern is observed for both CNN-based and 8 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of Top-3 retrieved images on the THINGS-EEG dataset (SIMON vs HyFI). [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of visual sampling configurations. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Top-5 accuracy improvement matrix (SIMON vs. Vanilla). B.5 Threshold Stability Analysis To examine the effect of the foreground threshold τ used in the saliency-aware sampling strategy, we conduct a dataset-wide stability analysis on the segmentation probability maps produced by BiRefNet over all 16,740 images in THINGS-EEG. Specifically, we measure how the foreground masks vary across a practical range o… view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of view-selection strategies under a fixed multi-view budget. Geometric￾center sampling serves as the center-fixed baseline. Random sampling produces only a minor change, whereas sampling from non-salient regions reduces retrieval accuracy. The SAS-based strategy yields the highest Top-1 and Top-5 performance under the same number of views. The comparison isolates the role of saliency modeling … view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative visualization of successful retrieval cases. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative visualization of failure cases. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
read the original abstract

Recent EEG-to-image retrieval methods leverage pretrained vision encoders and foveation-inspired priors, but typically assume a fixed, center-focused view. This center bias conflicts with content-driven human attention, creating a geometric-semantic dissociation between visual features and EEG responses. We propose SIMON, a saliency-aware multi-view framework for zero-shot EEG-to-image retrieval. SIMON combines foreground segmentation and saliency prediction to select fixation centers via Saliency-Aware Sampling (SAS), then generates foveated views that emphasize informative object regions while suppressing background clutter. On THINGS-EEG, SIMON achieves state-of-the-art performance in both intra-subject and inter-subject settings, reaching an average Top-1 accuracy of 69.7% and 19.6%, respectively, consistently outperforming recent competitive baselines. Analyses across sampling granularity, EEG channel topology, and visual/brain encoder backbones further support the robustness of saliency-aware multi-view integration. Our code and models are publicly available at https://github.com/simonlink666/SIMON.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes SIMON, a saliency-aware multi-view framework for zero-shot EEG-to-image retrieval. It uses Saliency-Aware Sampling (SAS) that integrates pretrained foreground segmentation and saliency prediction to select fixation centers, generating foveated views to address center-bias mismatch with content-driven EEG responses. On the THINGS-EEG dataset, it reports state-of-the-art intra-subject Top-1 accuracy of 69.7% and inter-subject accuracy of 19.6%, outperforming baselines, with supporting analyses on sampling granularity, channel topology, and encoder backbones. Code is publicly released.

Significance. If the alignment between SAS fixations and EEG-driven object regions holds, the approach could meaningfully advance neural decoding by reducing geometric-semantic dissociation in foveated multi-view setups, with implications for brain-computer interfaces and zero-shot retrieval. Public code supports reproducibility.

major comments (2)
  1. [Abstract] Abstract and Experiments: The central claim attributes SOTA gains (69.7% intra / 19.6% inter Top-1) to SAS resolving center-bias mismatch with EEG content-driven attention. However, no direct metric is reported (e.g., spatial correlation between SAS maps and EEG decoding weights, or fixation-EEG response overlap) to confirm that saliency/segmentation-derived centers align with regions driving the recorded signals rather than providing generic multi-view ensembling benefits. This is load-bearing for the main contribution.
  2. [Evaluation] Evaluation section: Performance claims are limited to a single dataset (THINGS-EEG) with no reported error bars, exact baseline details, or statistical significance tests. This undermines the robustness conclusions drawn from analyses on sampling granularity and channel topology.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with point-by-point responses and indicate the revisions we will incorporate to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experiments: The central claim attributes SOTA gains (69.7% intra / 19.6% inter Top-1) to SAS resolving center-bias mismatch with EEG content-driven attention. However, no direct metric is reported (e.g., spatial correlation between SAS maps and EEG decoding weights, or fixation-EEG response overlap) to confirm that saliency/segmentation-derived centers align with regions driving the recorded signals rather than providing generic multi-view ensembling benefits. This is load-bearing for the main contribution.

    Authors: We agree that a direct quantitative link between SAS fixation centers and EEG-driven regions would strengthen the interpretation. Our ablations already isolate the benefit of SAS over center-biased and random multi-view baselines, indicating that the gains exceed generic ensembling. In the revised manuscript we will add a targeted analysis that correlates SAS-derived saliency maps with spatial patterns obtained from the EEG encoder (e.g., via gradient-based attribution or channel-wise decoding weights) and report overlap metrics. This addition will clarify the contribution of content-aware sampling. revision: partial

  2. Referee: [Evaluation] Evaluation section: Performance claims are limited to a single dataset (THINGS-EEG) with no reported error bars, exact baseline details, or statistical significance tests. This undermines the robustness conclusions drawn from analyses on sampling granularity and channel topology.

    Authors: We accept that more rigorous statistical reporting is required. The revised evaluation section will include error bars (standard deviation across subjects and multiple random seeds), precise hyperparameter and implementation details for all baselines, and statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) for the reported improvements. Although THINGS-EEG remains the primary public benchmark for EEG-to-image retrieval, we will expand the discussion to explicitly note the single-dataset limitation and outline directions for future multi-dataset validation. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper presents an empirical framework (SIMON) that integrates pretrained saliency prediction and foreground segmentation to generate multi-view foveated inputs for EEG-to-image retrieval, then reports measured Top-1 accuracies on the public THINGS-EEG dataset. No first-principles derivation, uniqueness theorem, or predictive equation is claimed; performance numbers are obtained by running the pipeline on held-out data rather than by fitting parameters whose outputs are then re-labeled as predictions. Any self-citations are incidental and do not substitute for the experimental results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities detailed beyond standard ML components and the new SAS module.

pith-pipeline@v0.9.0 · 5486 in / 973 out tokens · 29008 ms · 2026-05-09T19:40:28.361663+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 4 canonical work pages

  1. [1]

    The footprint of colour in EEG signal.bioRxiv, 2025

    Arash Akbarinia. The footprint of colour in EEG signal.bioRxiv, 2025. doi: 10.1101/2025.10. 06.680651. This version posted December 4, 2025

  2. [2]

    Visual attention: The past 25 years.Vision Research, 51(13):1484–1525, 2011

    Marisa Carrasco. Visual attention: The past 25 years.Vision Research, 51(13):1484–1525, 2011

  3. [3]

    Caplovitz, Taissa K

    Patrick Cavanagh, Gideon P. Caplovitz, Taissa K. Lytchenko, Marvin R. Maechler, Peter U. Tse, and David L. Sheinberg. The architecture of object-based attention.Psychonomic Bulletin & Review, 30(5):1643–1667, 2023

  4. [4]

    arXiv preprint arXiv:2408.06788 , year=

    Hongzhou Chen, Lianghua He, Yihang Liu, and Longzhen Yang. Visual neural decoding via improved visual-EEG semantic consistency.arXiv preprint arXiv:2408.06788, 2024

  5. [5]

    Reproducible scaling laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, 2023

  6. [6]

    A large and rich EEG dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

    Aleato T Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M Cichy. A large and rich EEG dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

  7. [7]

    Microsaccades as an overt measure of covert attention shifts

    Ziad M Hafed and James J Clark. Microsaccades as an overt measure of covert attention shifts. Vision Research, 42(22):2533–2545, 2002

  8. [8]

    Deep residual learning for im- age recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

  9. [9]

    SUM: Saliency unification through mamba for visual attention modeling

    Alireza Hosseini, Amirhossein Kazerouni, Saeed Akhavan, Michael Brudno, and Babak Taati. SUM: Saliency unification through mamba for visual attention modeling. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

  10. [10]

    Hyfi: Hyper- bolic feature interpolation for brain-vision alignment

    Sihyeon Jo, Wonsik Jeong, Dong-Won Heo, Yoosung Hwang, and Heung-Il Suk. Hyfi: Hyper- bolic feature interpolation for brain-vision alignment. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

  11. [11]

    Visual saliency and image reconstruction from EEG signals via an effective geometric deep network- based generative adversarial network.Electronics, 11(21):3637, 2022

    Narges Khaleghi, Tohid Yousefi Rezaii, Soosan Beheshti, and Mohammad Reza Daliri. Visual saliency and image reconstruction from EEG signals via an effective geometric deep network- based generative adversarial network.Electronics, 11(21):3637, 2022

  12. [12]

    what” and “where

    David A. Klindt, Alexander S. Ecker, Thomas Euler, and Matthias Bethge. Neural system identification for large populations separating “what” and “where”. InAdvances in Neural Information Processing Systems (NeurIPS), pages 3506–3516, 2017

  13. [13]

    EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces.Journal of Neural Engineering, 15(5):056013, 2018

    Vernon J Lawhern, Amelia J Solon, Nicholas R Waytowich, Stephen M Gordon, Chou P Hung, and Brent J Lance. EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces.Journal of Neural Engineering, 15(5):056013, 2018

  14. [14]

    Visual decoding and reconstruction via EEG embeddings with guided diffusion

    Dongfang Li, Caixia Wei, Shichao Li, Jiachen Zou, Hu Qin, and Qunsheng Liu. Visual decoding and reconstruction via EEG embeddings with guided diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  15. [15]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

  16. [16]

    The role of fixational eye movements in visual perception.Nature Reviews Neuroscience, 5(3):229–240, 2004

    Susana Martinez-Conde, Stephen L Macknik, and David H Hubel. The role of fixational eye movements in visual perception.Nature Reviews Neuroscience, 5(3):229–240, 2004. 10

  17. [17]

    Scanpaths in eye movements during pattern perception

    David Noton and Lawrence Stark. Scanpaths in eye movements during pattern perception. Science, 171(3968):308–311, 1971

  18. [18]

    O’Connell and Marvin M

    Thomas P. O’Connell and Marvin M. Chun. Predicting eye movement patterns from fmri responses to natural scenes.Nature Communications, 9:5159, 2018

  19. [19]

    Simone Palazzo, Concetto Spampinato, Isaak Kavasidis, Daniela Giordano, Joseph Schmidt, and Mubarak Shah. Decoding brain representations by multimodal learning of neural activity and visual features.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11):3833–3849, 2021

  20. [20]

    Orienting of attention.Quarterly Journal of Experimental Psychology, 32(1): 3–25, 1980

    Michael I Posner. Orienting of attention.Quarterly Journal of Experimental Psychology, 32(1): 3–25, 1980

  21. [21]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748–8763, 2021

  22. [22]

    The dynamic representation of scenes.Visual Cognition, 7(1-3):17–42, 2000

    Ronald A Rensink. The dynamic representation of scenes.Visual Cognition, 7(1-3):17–42, 2000

  23. [23]

    Edmund T. Rolls. Two what, two where, visual cortical streams in humans.Neuroscience & Biobehavioral Reviews, 160:105650, 2024

  24. [24]

    Deep learning with convolutional neural networks for EEG decoding and visualization.Human Brain Mapping, 38(11):5391–5420, 2017

    Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Tonio Ball, and Wolfram Burgard. Deep learning with convolutional neural networks for EEG decoding and visualization.Human Brain Mapping, 38(11):5391–5420, 2017

  25. [25]

    Decoding natural images from EEG for object recognition

    Yizhe Song, Bingbei Liu, Xuelin Li, Nan Shi, Yijie Wang, and Xiaoguang Gao. Decoding natural images from EEG for object recognition. InInternational Conference on Learning Representations (ICLR), 2024

  26. [26]

    Bridging the vision-brain gap with an uncertainty-aware blur prior

    Haitao Wu, Qing Li, Changqing Zhang, Zhen He, and Xiaomin Ying. Bridging the vision-brain gap with an uncertainty-aware blur prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  27. [27]

    Neurobridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirectional semantic alignment,

    Wenjiang Zhang, Sifeng Wang, Yuwei Su, Xinyu Li, Chen Zhang, and Suyu Zhong. Neuro- bridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirec- tional semantic alignment.arXiv preprint arXiv:2511.06836, 2025

  28. [28]

    better safe than sorry

    Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Debra Laefer, and Ming-Ming Cheng. Bilateral reference for high-resolution dichotomous image segmentation.arXiv preprint arXiv:2401.03407, 2024. 11 A Experiment Configuration A.1 Datasets and Preprocessing Dataset.We evaluate our method on the THINGS-EEG dataset [6], a large-scale benchmark collected using a ...