Multi-Level Bidirectional Biomimetic Learning for EEG-Based Visual Decoding

Chuhang Zheng; Jingtao Liu; Peiliang Gong; Qi Zhu; Yiheng Liu

arxiv: 2605.04680 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.AI

Multi-Level Bidirectional Biomimetic Learning for EEG-Based Visual Decoding

Jingtao Liu , Peiliang Gong , Chuhang Zheng , Yiheng Liu , Qi Zhu This is my paper

Pith reviewed 2026-05-08 17:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords EEG visual decodingbiomimetic learningcontrastive learningzero-shot retrievalretinotopic priorscross-modal alignment

0 comments

The pith

A biomimetic framework aligns EEG brain signals with images to enable zero-shot retrieval at 80.5 percent top-1 accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that structured physiological inductive biases can overcome the mismatch between high-fidelity digital images and biological visual perception distorted by retinotopic mapping and subject-specific neuroanatomy. It introduces Adaptive Blur with Visual Priors to reweight inputs according to retinotopic information and Biomimetic Visual Feature Extraction to produce multi-level representations matching hierarchical cortical processing. These components are trained together through Multi-level Bidirectional Contrastive Learning to place EEG and visual features in a shared semantic space. The resulting system reaches 80.5 percent top-1 and 97.6 percent top-5 accuracy on zero-shot EEG-to-image retrieval while generalizing across subjects and settings.

Core claim

MB2L achieves 80.5 percent Top-1 and 97.6 percent Top-5 accuracy on zero-shot EEG-to-image retrieval by jointly optimizing Adaptive Blur with Visual Priors to mitigate perceptual-structural mismatch, Biomimetic Visual Feature Extraction to learn multi-level visual representations consistent with hierarchical cortical processing, and Multi-level Bidirectional Contrastive Learning to align EEG and visual features in a shared semantic space.

What carries the argument

Multi-level Bidirectional Contrastive Learning, which aligns EEG features with multi-level visual representations produced after Adaptive Blur with Visual Priors and Biomimetic Visual Feature Extraction.

If this is right

EEG-to-image retrieval becomes reliable enough for practical zero-shot applications across different people and recording conditions.
Subject-invariant visual encoding improves because the model learns representations consistent with shared cortical hierarchy rather than individual anatomy.
Limited paired EEG-image data can still support strong alignment when physiological priors are injected into the visual branch.
Bidirectional contrastive objectives at multiple levels enforce semantic consistency that single-level alignment cannot achieve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retinotopic reweighting and hierarchical extraction steps could be tested on other noninvasive signals such as MEG or fMRI for visual decoding.
If the multi-level alignment holds, the framework might support real-time brain-computer interfaces that reconstruct perceived images without per-user recalibration.
Extending the contrastive objectives to include additional modalities like text descriptions of the images could further tighten the shared semantic space.

Load-bearing premise

The assumption that Adaptive Blur with Visual Priors and Biomimetic Visual Feature Extraction, when optimized together via multi-level bidirectional contrastive learning, will sufficiently reduce the fundamental mismatch between digital images and subject-specific biological visual perception.

What would settle it

An ablation study on a held-out subject group in which removing the Adaptive Blur with Visual Priors module causes zero-shot top-1 retrieval accuracy to fall below the best prior method without biomimetic components.

Figures

Figures reproduced from arXiv: 2605.04680 by Chuhang Zheng, Jingtao Liu, Peiliang Gong, Qi Zhu, Yiheng Liu.

**Figure 1.** Figure 1: Schematic of visual processing and neural responses. Left: topographic mapping of visual stimuli in the view at source ↗

**Figure 2.** Figure 2: Overall framework of MB2L.(1) Adaptive Blur with Visual Priors (top-left): The original image undergoes biomimetic blurring, then hierarchical visual features are extracted via low- and high-level encoders; (2) Biomimetic Visual Feature Extraction (bottom-left): EEG signals are split by a channel-weighted layer, then encoded into hierarchical features through cross-attention; (3) Multi-level Bidirectional … view at source ↗

**Figure 3.** Figure 3: Visualization of representative images pro view at source ↗

**Figure 5.** Figure 5: EEG channel attention heatmaps across subjects for low- and high-level features. view at source ↗

**Figure 6.** Figure 6: Visualization of adaptive blur function with view at source ↗

**Figure 8.** Figure 8: Top-1 accuracy (%) of MB2L across various brain and vision encoder combinations on the THINGSEEG. To verify the generalizability of our framework, we conducted comprehensive experiments by training over one thousand models with four EEG encoders and five image encoders spanning diverse architectural variants, including both lightweight and deep models. Across all settings, our framework consistently outp… view at source ↗

**Figure 9.** Figure 9: Effect of Incorporating ABVP across Different Image Processing Methods view at source ↗

**Figure 10.** Figure 10: Visualization of Retinal Topology Fitting Function view at source ↗

**Figure 11.** Figure 11: Similarity matrices for all subjects except Subject 8 view at source ↗

**Figure 12.** Figure 12: Good Cases:Top-5 Retrieval Results for Various Stimuli view at source ↗

**Figure 13.** Figure 13: Bad Cases:Top-5 Retrieval Results for Various Stimuli view at source ↗

read the original abstract

EEG-based visual neural decoding aims to align neural responses with visual stimuli for tasks such as image retrieval. However, limited paired data and a fundamental mismatch between high-fidelity digital images and biological visual perception - distorted by retinotopic mapping and subject-specific neuroanatomy - severely impede cross-modal alignment. To address this, we propose MB2L, a Multi-Level Bidirectional Biomimetic Learning framework that incorporates structured physiological inductive biases into representation learning. Specifically, we propose Adaptive Blur with Visual Priors to mitigate perceptual-structural mismatch by reweighting visual inputs according to retinotopic priors. We further propose Biomimetic Visual Feature Extraction to learn multi-level visual representations consistent with hierarchical cortical processing, enhancing subject-invariant encoding. These modules are jointly optimized via Multi-level Bidirectional Contrastive Learning, which aligns EEG and visual features in a shared semantic space through bidirectional contrastive objectives. Experiments show MB2L achieves 80.5% Top-1 and 97.6% Top-5 accuracy on zero-shot EEG-to-image retrieval, significantly outperforming prior methods and demonstrating strong generalization across subjects and experimental settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MB2L adds retinotopic blur and cortical hierarchy to bidirectional contrastive learning for EEG-image retrieval and reports 80.5% top-1 zero-shot accuracy, but the abstract gives almost no experimental controls to show the priors are doing real work.

read the letter

The main thing to know is that this paper claims a clear performance jump on zero-shot EEG-to-image retrieval by folding in two physiological priors: adaptive blurring of images according to retinotopic maps and multi-level feature extraction meant to match cortical stages. Those pieces are then trained with bidirectional contrastive loss. The numbers (80.5% top-1, 97.6% top-5) are high enough to notice if they survive scrutiny, and the framing around data scarcity plus the brain-digital mismatch is straightforward.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes MB2L, a Multi-Level Bidirectional Biomimetic Learning framework for EEG-based visual decoding. It introduces Adaptive Blur with Visual Priors to reweight inputs according to retinotopic priors, Biomimetic Visual Feature Extraction for multi-level cortical-consistent representations, and joint optimization via Multi-level Bidirectional Contrastive Learning. The central claim is that this yields 80.5% Top-1 and 97.6% Top-5 accuracy on zero-shot EEG-to-image retrieval, significantly outperforming prior methods with strong generalization across subjects and settings.

Significance. If the performance claims and module contributions are rigorously validated, the work would advance EEG-to-image retrieval by embedding physiological priors into cross-modal alignment, with potential implications for brain-computer interfaces. The reported accuracies are high enough to suggest practical utility, but only if ablations and diagnostics confirm that the biomimetic components drive gains beyond standard contrastive learning on the dataset statistics.

major comments (2)

[Introduction/Methods] Introduction and Methods: The claim that Adaptive Blur with Visual Priors and Biomimetic Visual Feature Extraction close the 'fundamental mismatch' between high-fidelity images and retinotopically distorted biological perception is load-bearing for the generalization and biomimetic framing. The manuscript provides no intermediate diagnostics (e.g., correlation of blurred features with V1/V2 EEG patterns, subject-specific retinotopic alignment error, or ablation isolating the priors from generic blur/pooling). Without these, it is unclear whether the modules contribute beyond data augmentation, weakening the central inductive-bias argument.
[Experiments/Results] Experiments/Results: The abstract states clear outperformance (80.5% Top-1, 97.6% Top-5) and cross-subject generalization, but the provided text lacks details on baseline implementations, statistical tests (e.g., p-values, confidence intervals), dataset sizes, ablation studies, or controls for the contrastive objective alone. This makes it impossible to verify that the physiological modules, rather than the bidirectional loss on raw statistics, produce the gains.

minor comments (1)

[Methods] Notation for the multi-level contrastive loss and the exact form of the Adaptive Blur reweighting should be formalized with equations in the Methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and have made revisions to strengthen the manuscript where the concerns are valid.

read point-by-point responses

Referee: [Introduction/Methods] Introduction and Methods: The claim that Adaptive Blur with Visual Priors and Biomimetic Visual Feature Extraction close the 'fundamental mismatch' between high-fidelity images and retinotopically distorted biological perception is load-bearing for the generalization and biomimetic framing. The manuscript provides no intermediate diagnostics (e.g., correlation of blurred features with V1/V2 EEG patterns, subject-specific retinotopic alignment error, or ablation isolating the priors from generic blur/pooling). Without these, it is unclear whether the modules contribute beyond data augmentation, weakening the central inductive-bias argument.

Authors: We agree that stronger intermediate diagnostics would better support the biomimetic framing. In the revised manuscript we have added an ablation isolating Adaptive Blur with Visual Priors from generic blur and no-blur baselines, showing consistent gains attributable to the retinotopic reweighting. We also include feature visualization and subject-specific performance breakdowns that demonstrate improved alignment with expected perceptual distortions. Direct correlation with V1/V2 EEG patterns is not feasible with the current dataset and recording montage, which lacks the spatial resolution for precise cortical localization; we have therefore expanded the discussion to clarify the design rationale drawn from established retinotopic and hierarchical models while acknowledging this limitation. revision: yes
Referee: [Experiments/Results] Experiments/Results: The abstract states clear outperformance (80.5% Top-1, 97.6% Top-5) and cross-subject generalization, but the provided text lacks details on baseline implementations, statistical tests (e.g., p-values, confidence intervals), dataset sizes, ablation studies, or controls for the contrastive objective alone. This makes it impossible to verify that the physiological modules, rather than the bidirectional loss on raw statistics, produce the gains.

Authors: We thank the referee for highlighting these omissions. The full manuscript already specifies the THINGS-EEG dataset sizes (10 subjects, trial counts per condition), re-implements baselines following their original papers, and reports ablation studies on the biomimetic modules. To directly address the concern, we have added p-values and 95% confidence intervals for the main retrieval metrics, plus a control ablation that applies only the bidirectional contrastive loss without the Adaptive Blur or Biomimetic Feature Extraction modules. This control shows a clear performance drop, supporting that the physiological components contribute beyond the loss function alone. These details and the new control experiment are now explicitly presented in the Experiments section and supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: proposed modules and objective are independent inductive biases

full rationale

The paper introduces Adaptive Blur with Visual Priors, Biomimetic Visual Feature Extraction, and Multi-level Bidirectional Contrastive Learning as new components to incorporate physiological priors and align modalities. These are jointly optimized on data to produce empirical retrieval accuracies; no equations, self-definitions, or self-citations reduce any claimed result to its own inputs by construction. The central performance numbers are experimental outcomes, not predictions forced by the framework's own definitions or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that physiological priors can be effectively modeled and injected into ML pipelines to bridge biological and digital visual representations; no free parameters or invented entities are explicitly detailed in the abstract.

axioms (1)

domain assumption Structured physiological inductive biases (retinotopic mapping and hierarchical cortical processing) can be incorporated into representation learning to mitigate perceptual-structural mismatch between EEG and images.
Invoked in the description of the two proposed modules as the core mechanism for improving alignment.

pith-pipeline@v0.9.0 · 5499 in / 1307 out tokens · 53615 ms · 2026-05-08T17:41:16.588456+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (J = ½(x+x⁻¹)−1) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

w(r) = σ_act(k(r − r_0)) where σ_act represents the activation function, k is the blur coefficient, r_0 represents the radius of the fovea ... k and r_0 are learnable parameters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 1 internal anchor

[1]

Why is there so much more research on vision than on any other sensory modality?

F. Hutmacher, “Why is there so much more research on vision than on any other sensory modality?”Frontiers in psychology, vol. 10, p. 481030, 2019

work page 2019
[2]

Neural structural underlying audiovisual working memory and visual dominance under cognitive load,

L. Jiayu, Z. Qiuzhu, L. Wenjuan, Z. Junjun, J. Zhenlan, and L. Ling, “Neural structural underlying audiovisual working memory and visual dominance under cognitive load,”Scientific Reports, vol. 15, no. 1, p. 32778, 2025

work page 2025
[3]

An attention-based bi-lstm method for visual object classification via eeg,

X. Zheng and W. Chen, “An attention-based bi-lstm method for visual object classification via eeg,”Biomedical Signal Processing and Control, vol. 63, p. 102174, 2021

work page 2021
[4]

Eeg2image: image reconstruction from eeg brain signals,

P. Singh, P. Pandey, K. Miyapuram, and S. Raman, “Eeg2image: image reconstruction from eeg brain signals,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. 9 Running Title for Header

work page 2023
[5]

Decoding natural images from eeg for object recognition

Y . Song, B. Liu, X. Li, N. Shi, Y . Wang, and X. Gao, “Decoding natural images from eeg for object recognition,”arXiv preprint arXiv:2308.13234, 2023

work page arXiv 2023
[6]

Brain decoding: toward real-time reconstruction of visual perception.arXiv preprint arXiv:2310.19812, 2023

Y . Benchetrit, H. Banville, and J.-R. King, “Brain decoding: toward real-time reconstruction of visual perception,”arXiv preprint arXiv:2310.19812, 2023

work page arXiv 2023
[7]

, year =

R. Liu, J. Wei, S. S. Gu, T.-Y . Wu, S. V osoughi, C. Cui, D. Zhou, and A. M. Dai, “Mind’s eye: Grounded language model reasoning through simulation,”arXiv preprint arXiv:2210.05359, 2022

work page arXiv 2022
[8]

Alleviating the semantic gap for generalized fmri-to-image reconstruction,

T. Fang, Q. Zheng, and G. Pan, “Alleviating the semantic gap for generalized fmri-to-image reconstruction,”Advances in Neural Information Processing Systems, vol. 36, pp. 15 096–15 107, 2023

work page 2023
[9]

Human retinotopic mapping: From empirical to computational models of retinotopy,

F. L. Ribeiro, N. C. Benson, and A. M. Puckett, “Human retinotopic mapping: From empirical to computational models of retinotopy,”Journal of Vision, vol. 25, no. 8, pp. 14–14, 2025

work page 2025
[10]

Top-down perceptual inference shaping the activity of early visual cortex,

F. Csikor, B. Meszéna, K. Ócsai, and G. Orbán, “Top-down perceptual inference shaping the activity of early visual cortex,” Nature Communications, vol. 16, no. 1, p. 9998, 2025

work page 2025
[11]

Visual image reconstruction from human brain activity using a combination of multiscale local image decoders,

Y . Miyawaki, H. Uchida, O. Yamashita, M.-a. Sato, Y . Morito, H. C. Tanabe, N. Sadato, and Y . Kamitani, “Visual image reconstruction from human brain activity using a combination of multiscale local image decoders,”Neuron, vol. 60, no. 5, pp. 915–929, 2008

work page 2008
[12]

Deep learning human mind for automated visual classification,

C. Spampinato, S. Palazzo, I. Kavasidis, D. Giordano, N. Souly, and M. Shah, “Deep learning human mind for automated visual classification,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6809–6817

work page 2017
[13]

Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

P. S. Scotti, M. Tripathy, C. K. T. Villanueva, R. Kneeland, T. Chen, A. Narang, C. Santhirasegaran, J. Xu, T. Naselaris, K. A. Normanet al., “Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data,”arXiv preprint arXiv:2403.11207, 2024

work page arXiv 2024
[14]

Brainvis: Exploring the bridge between brain and visual signals via image reconstruction,

H. Fu, H. Wang, J. J. Chin, and Z. Shen, “Brainvis: Exploring the bridge between brain and visual signals via image reconstruction,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[15]

Visual neural decod- ing via improved visual-eeg semantic consistency

H. Chen, L. He, Y . Liu, and L. Yang, “Visual neural decoding via improved visual-eeg semantic consistency,”arXiv preprint arXiv:2408.06788, 2024

work page arXiv 2024
[16]

Decoding visual neural representations by multimodal learning of brain-visual-linguistic features,

C. Du, K. Fu, J. Li, and H. He, “Decoding visual neural representations by multimodal learning of brain-visual-linguistic features,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 760–10 777, 2023

work page 2023
[17]

Bridging the vision-brain gap with an uncertainty-aware blur prior,

H. Wu, Q. Li, C. Zhang, Z. He, and X. Ying, “Bridging the vision-brain gap with an uncertainty-aware blur prior,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2246–2257

work page 2025
[18]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Ed...

work page 2021
[19]

Vieeg: Hierarchical visual neural representation for eeg brain decoding,

M. Liu, D. Guan, C. Zheng, C. Tian, J. Wen, and Q. Zhu, “Vieeg: Hierarchical visual neural representation for eeg brain decoding,”arXiv preprint arXiv:2505.12408, 2025

work page arXiv 2025
[20]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,”Advances in neural information processing systems, vol. 35, pp. 25 278–25 294, 2022

work page 2022
[21]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

work page 2022
[22]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR...

work page 2021
[23]

Frozen in time: A joint video and image encoder for end-to-end retrieval,

M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1728–1738

work page 2021
[24]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,

L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1179–1189

work page 2023
[25]

Bayesian sampling in visual perception,

R. Moreno-Bote, D. C. Knill, and A. Pouget, “Bayesian sampling in visual perception,”Proceedings of the National Academy of Sciences, vol. 108, no. 30, pp. 12 491–12 496, 2011

work page 2011
[26]

Deep problems with neural network models of human vision,

J. S. Bowers, G. Malhotra, M. Dujmovi´c, M. L. Montero, C. Tsvetkov, V . Biscione, G. Puebla, F. Adolfi, J. E. Hummel, R. F. Heatonet al., “Deep problems with neural network models of human vision,”Behavioral and Brain Sciences, vol. 46, p. e385, 2023

work page 2023
[27]

Improved modeling of human vision by incorporating robustness to blur in convolutional neural networks,

H. Jang and F. Tong, “Improved modeling of human vision by incorporating robustness to blur in convolutional neural networks,” Nature Communications, vol. 15, no. 1, p. 1989, 2024. 10 Running Title for Header

work page 1989
[28]

Neurobridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirectional semantic alignment,

W. Zhang, S. Wang, Y . Su, X. Li, C. Zhang, and S. Zhong, “Neurobridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirectional semantic alignment,”arXiv preprint arXiv:2511.06836, 2025

work page arXiv 2025
[29]

A large and rich eeg dataset for modeling human visual object recognition,

A. T. Gifford, K. Dwivedi, G. Roig, and R. M. Cichy, “A large and rich eeg dataset for modeling human visual object recognition,”NeuroImage, vol. 264, p. 119754, 2022

work page 2022
[30]

The representational dynamics of visual objects in rapid serial visual processing streams,

T. Grootswagers, A. K. Robinson, and T. A. Carlson, “The representational dynamics of visual objects in rapid serial visual processing streams,”NeuroImage, vol. 188, pp. 668–679, 2019

work page 2019
[31]

Rapid conceptual identification of sequentially presented pictures

H. Intraub, “Rapid conceptual identification of sequentially presented pictures.”Journal of Experimental Psychology: Human Perception and Performance, vol. 7, no. 3, p. 604, 1981

work page 1981
[32]

The speed of sight,

C. Keysers, D.-K. Xiao, P. Földiák, and D. I. Perrett, “The speed of sight,”Journal of cognitive neuroscience, vol. 13, no. 1, pp. 90–101, 2001

work page 2001
[33]

Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior,

M. N. Hebart, O. Contier, L. Teichmann, A. H. Rockter, C. Y . Zheng, A. Kidder, A. Corriveau, M. Vaziri-Pashkam, and C. I. Baker, “Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior,”Elife, vol. 12, p. e82580, 2023

work page 2023
[34]

Deep learning with convolutional neural networks for eeg decoding and visualization,

R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard, and T. Ball, “Deep learning with convolutional neural networks for eeg decoding and visualization,”Human brain mapping, vol. 38, no. 11, pp. 5391–5420, 2017

work page 2017
[35]

Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces,

V . J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance, “Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces,”Journal of neural engineering, vol. 15, no. 5, p. 056013, 2018

work page 2018
[36]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016
[37]

Visual decoding and reconstruction via eeg embeddings with guided diffusion

D. Li, C. Wei, S. Li, J. Zou, H. Qin, and Q. Liu, “Visual decoding and reconstruction via eeg embeddings with guided diffusion,” arXiv preprint arXiv:2403.07721, 2024

work page arXiv 2024
[38]

Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding

Y . Li, Z. Kang, S. Gong, W. Dong, W. Zeng, H. Yan, W. T. Siok, and N. Wang, “Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding,”arXiv preprint arXiv:2412.17337, 2024

work page arXiv 2024
[39]

Cognitioncapturer: Decoding visual stimuli from human eeg signal with multimodal information,

K. Zhang, L. He, X. Jiang, W. Lu, D. Wang, and X. Gao, “Cognitioncapturer: Decoding visual stimuli from human eeg signal with multimodal information,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 13, 2025, pp. 14 486–14 493

work page 2025
[40]

Ifcn standards for digital recording of clinical eeg,

M. R. Nuwer, G. Comi, R. Emerson, A. Fuglsang-Frederiksen, J.-M. Guérit, H. Hinrichs, A. Ikeda, F. J. C. Luccas, and P. Rappelsburger, “Ifcn standards for digital recording of clinical eeg,”Electroencephalography and clinical Neurophysiology, vol. 106, no. 3, pp. 259–261, 1998

work page 1998
[41]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”arXiv preprint arXiv:1711.05101, 2017. 11 Running Title for Header A Experimental details A.1 Datasets details THINGS-EEG [29] is a large-scale EEG dataset involving 10 subjects, collected using the Rapid Serial Visual Presentation (RSVP) paradigm [30, 31, 32]. The EEG data were recorded...

work page internal anchor Pith review arXiv 2017

[1] [1]

Why is there so much more research on vision than on any other sensory modality?

F. Hutmacher, “Why is there so much more research on vision than on any other sensory modality?”Frontiers in psychology, vol. 10, p. 481030, 2019

work page 2019

[2] [2]

Neural structural underlying audiovisual working memory and visual dominance under cognitive load,

L. Jiayu, Z. Qiuzhu, L. Wenjuan, Z. Junjun, J. Zhenlan, and L. Ling, “Neural structural underlying audiovisual working memory and visual dominance under cognitive load,”Scientific Reports, vol. 15, no. 1, p. 32778, 2025

work page 2025

[3] [3]

An attention-based bi-lstm method for visual object classification via eeg,

X. Zheng and W. Chen, “An attention-based bi-lstm method for visual object classification via eeg,”Biomedical Signal Processing and Control, vol. 63, p. 102174, 2021

work page 2021

[4] [4]

Eeg2image: image reconstruction from eeg brain signals,

P. Singh, P. Pandey, K. Miyapuram, and S. Raman, “Eeg2image: image reconstruction from eeg brain signals,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. 9 Running Title for Header

work page 2023

[5] [5]

Decoding natural images from eeg for object recognition

Y . Song, B. Liu, X. Li, N. Shi, Y . Wang, and X. Gao, “Decoding natural images from eeg for object recognition,”arXiv preprint arXiv:2308.13234, 2023

work page arXiv 2023

[6] [6]

Brain decoding: toward real-time reconstruction of visual perception.arXiv preprint arXiv:2310.19812, 2023

Y . Benchetrit, H. Banville, and J.-R. King, “Brain decoding: toward real-time reconstruction of visual perception,”arXiv preprint arXiv:2310.19812, 2023

work page arXiv 2023

[7] [7]

, year =

R. Liu, J. Wei, S. S. Gu, T.-Y . Wu, S. V osoughi, C. Cui, D. Zhou, and A. M. Dai, “Mind’s eye: Grounded language model reasoning through simulation,”arXiv preprint arXiv:2210.05359, 2022

work page arXiv 2022

[8] [8]

Alleviating the semantic gap for generalized fmri-to-image reconstruction,

T. Fang, Q. Zheng, and G. Pan, “Alleviating the semantic gap for generalized fmri-to-image reconstruction,”Advances in Neural Information Processing Systems, vol. 36, pp. 15 096–15 107, 2023

work page 2023

[9] [9]

Human retinotopic mapping: From empirical to computational models of retinotopy,

F. L. Ribeiro, N. C. Benson, and A. M. Puckett, “Human retinotopic mapping: From empirical to computational models of retinotopy,”Journal of Vision, vol. 25, no. 8, pp. 14–14, 2025

work page 2025

[10] [10]

Top-down perceptual inference shaping the activity of early visual cortex,

F. Csikor, B. Meszéna, K. Ócsai, and G. Orbán, “Top-down perceptual inference shaping the activity of early visual cortex,” Nature Communications, vol. 16, no. 1, p. 9998, 2025

work page 2025

[11] [11]

Visual image reconstruction from human brain activity using a combination of multiscale local image decoders,

Y . Miyawaki, H. Uchida, O. Yamashita, M.-a. Sato, Y . Morito, H. C. Tanabe, N. Sadato, and Y . Kamitani, “Visual image reconstruction from human brain activity using a combination of multiscale local image decoders,”Neuron, vol. 60, no. 5, pp. 915–929, 2008

work page 2008

[12] [12]

Deep learning human mind for automated visual classification,

C. Spampinato, S. Palazzo, I. Kavasidis, D. Giordano, N. Souly, and M. Shah, “Deep learning human mind for automated visual classification,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6809–6817

work page 2017

[13] [13]

Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

P. S. Scotti, M. Tripathy, C. K. T. Villanueva, R. Kneeland, T. Chen, A. Narang, C. Santhirasegaran, J. Xu, T. Naselaris, K. A. Normanet al., “Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data,”arXiv preprint arXiv:2403.11207, 2024

work page arXiv 2024

[14] [14]

Brainvis: Exploring the bridge between brain and visual signals via image reconstruction,

H. Fu, H. Wang, J. J. Chin, and Z. Shen, “Brainvis: Exploring the bridge between brain and visual signals via image reconstruction,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025

[15] [15]

Visual neural decod- ing via improved visual-eeg semantic consistency

H. Chen, L. He, Y . Liu, and L. Yang, “Visual neural decoding via improved visual-eeg semantic consistency,”arXiv preprint arXiv:2408.06788, 2024

work page arXiv 2024

[16] [16]

Decoding visual neural representations by multimodal learning of brain-visual-linguistic features,

C. Du, K. Fu, J. Li, and H. He, “Decoding visual neural representations by multimodal learning of brain-visual-linguistic features,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 760–10 777, 2023

work page 2023

[17] [17]

Bridging the vision-brain gap with an uncertainty-aware blur prior,

H. Wu, Q. Li, C. Zhang, Z. He, and X. Ying, “Bridging the vision-brain gap with an uncertainty-aware blur prior,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2246–2257

work page 2025

[18] [18]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Ed...

work page 2021

[19] [19]

Vieeg: Hierarchical visual neural representation for eeg brain decoding,

M. Liu, D. Guan, C. Zheng, C. Tian, J. Wen, and Q. Zhu, “Vieeg: Hierarchical visual neural representation for eeg brain decoding,”arXiv preprint arXiv:2505.12408, 2025

work page arXiv 2025

[20] [20]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,”Advances in neural information processing systems, vol. 35, pp. 25 278–25 294, 2022

work page 2022

[21] [21]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

work page 2022

[22] [22]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR...

work page 2021

[23] [23]

Frozen in time: A joint video and image encoder for end-to-end retrieval,

M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1728–1738

work page 2021

[24] [24]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,

L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1179–1189

work page 2023

[25] [25]

Bayesian sampling in visual perception,

R. Moreno-Bote, D. C. Knill, and A. Pouget, “Bayesian sampling in visual perception,”Proceedings of the National Academy of Sciences, vol. 108, no. 30, pp. 12 491–12 496, 2011

work page 2011

[26] [26]

Deep problems with neural network models of human vision,

J. S. Bowers, G. Malhotra, M. Dujmovi´c, M. L. Montero, C. Tsvetkov, V . Biscione, G. Puebla, F. Adolfi, J. E. Hummel, R. F. Heatonet al., “Deep problems with neural network models of human vision,”Behavioral and Brain Sciences, vol. 46, p. e385, 2023

work page 2023

[27] [27]

Improved modeling of human vision by incorporating robustness to blur in convolutional neural networks,

H. Jang and F. Tong, “Improved modeling of human vision by incorporating robustness to blur in convolutional neural networks,” Nature Communications, vol. 15, no. 1, p. 1989, 2024. 10 Running Title for Header

work page 1989

[28] [28]

Neurobridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirectional semantic alignment,

W. Zhang, S. Wang, Y . Su, X. Li, C. Zhang, and S. Zhong, “Neurobridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirectional semantic alignment,”arXiv preprint arXiv:2511.06836, 2025

work page arXiv 2025

[29] [29]

A large and rich eeg dataset for modeling human visual object recognition,

A. T. Gifford, K. Dwivedi, G. Roig, and R. M. Cichy, “A large and rich eeg dataset for modeling human visual object recognition,”NeuroImage, vol. 264, p. 119754, 2022

work page 2022

[30] [30]

The representational dynamics of visual objects in rapid serial visual processing streams,

T. Grootswagers, A. K. Robinson, and T. A. Carlson, “The representational dynamics of visual objects in rapid serial visual processing streams,”NeuroImage, vol. 188, pp. 668–679, 2019

work page 2019

[31] [31]

Rapid conceptual identification of sequentially presented pictures

H. Intraub, “Rapid conceptual identification of sequentially presented pictures.”Journal of Experimental Psychology: Human Perception and Performance, vol. 7, no. 3, p. 604, 1981

work page 1981

[32] [32]

The speed of sight,

C. Keysers, D.-K. Xiao, P. Földiák, and D. I. Perrett, “The speed of sight,”Journal of cognitive neuroscience, vol. 13, no. 1, pp. 90–101, 2001

work page 2001

[33] [33]

Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior,

M. N. Hebart, O. Contier, L. Teichmann, A. H. Rockter, C. Y . Zheng, A. Kidder, A. Corriveau, M. Vaziri-Pashkam, and C. I. Baker, “Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior,”Elife, vol. 12, p. e82580, 2023

work page 2023

[34] [34]

Deep learning with convolutional neural networks for eeg decoding and visualization,

R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard, and T. Ball, “Deep learning with convolutional neural networks for eeg decoding and visualization,”Human brain mapping, vol. 38, no. 11, pp. 5391–5420, 2017

work page 2017

[35] [35]

Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces,

V . J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance, “Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces,”Journal of neural engineering, vol. 15, no. 5, p. 056013, 2018

work page 2018

[36] [36]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016

[37] [37]

Visual decoding and reconstruction via eeg embeddings with guided diffusion

D. Li, C. Wei, S. Li, J. Zou, H. Qin, and Q. Liu, “Visual decoding and reconstruction via eeg embeddings with guided diffusion,” arXiv preprint arXiv:2403.07721, 2024

work page arXiv 2024

[38] [38]

Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding

Y . Li, Z. Kang, S. Gong, W. Dong, W. Zeng, H. Yan, W. T. Siok, and N. Wang, “Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding,”arXiv preprint arXiv:2412.17337, 2024

work page arXiv 2024

[39] [39]

Cognitioncapturer: Decoding visual stimuli from human eeg signal with multimodal information,

K. Zhang, L. He, X. Jiang, W. Lu, D. Wang, and X. Gao, “Cognitioncapturer: Decoding visual stimuli from human eeg signal with multimodal information,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 13, 2025, pp. 14 486–14 493

work page 2025

[40] [40]

Ifcn standards for digital recording of clinical eeg,

M. R. Nuwer, G. Comi, R. Emerson, A. Fuglsang-Frederiksen, J.-M. Guérit, H. Hinrichs, A. Ikeda, F. J. C. Luccas, and P. Rappelsburger, “Ifcn standards for digital recording of clinical eeg,”Electroencephalography and clinical Neurophysiology, vol. 106, no. 3, pp. 259–261, 1998

work page 1998

[41] [41]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”arXiv preprint arXiv:1711.05101, 2017. 11 Running Title for Header A Experimental details A.1 Datasets details THINGS-EEG [29] is a large-scale EEG dataset involving 10 subjects, collected using the Rapid Serial Visual Presentation (RSVP) paradigm [30, 31, 32]. The EEG data were recorded...

work page internal anchor Pith review arXiv 2017