Multi-Level Bidirectional Biomimetic Learning for EEG-Based Visual Decoding
Pith reviewed 2026-05-08 17:41 UTC · model grok-4.3
The pith
A biomimetic framework aligns EEG brain signals with images to enable zero-shot retrieval at 80.5 percent top-1 accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MB2L achieves 80.5 percent Top-1 and 97.6 percent Top-5 accuracy on zero-shot EEG-to-image retrieval by jointly optimizing Adaptive Blur with Visual Priors to mitigate perceptual-structural mismatch, Biomimetic Visual Feature Extraction to learn multi-level visual representations consistent with hierarchical cortical processing, and Multi-level Bidirectional Contrastive Learning to align EEG and visual features in a shared semantic space.
What carries the argument
Multi-level Bidirectional Contrastive Learning, which aligns EEG features with multi-level visual representations produced after Adaptive Blur with Visual Priors and Biomimetic Visual Feature Extraction.
If this is right
- EEG-to-image retrieval becomes reliable enough for practical zero-shot applications across different people and recording conditions.
- Subject-invariant visual encoding improves because the model learns representations consistent with shared cortical hierarchy rather than individual anatomy.
- Limited paired EEG-image data can still support strong alignment when physiological priors are injected into the visual branch.
- Bidirectional contrastive objectives at multiple levels enforce semantic consistency that single-level alignment cannot achieve.
Where Pith is reading between the lines
- The same retinotopic reweighting and hierarchical extraction steps could be tested on other noninvasive signals such as MEG or fMRI for visual decoding.
- If the multi-level alignment holds, the framework might support real-time brain-computer interfaces that reconstruct perceived images without per-user recalibration.
- Extending the contrastive objectives to include additional modalities like text descriptions of the images could further tighten the shared semantic space.
Load-bearing premise
The assumption that Adaptive Blur with Visual Priors and Biomimetic Visual Feature Extraction, when optimized together via multi-level bidirectional contrastive learning, will sufficiently reduce the fundamental mismatch between digital images and subject-specific biological visual perception.
What would settle it
An ablation study on a held-out subject group in which removing the Adaptive Blur with Visual Priors module causes zero-shot top-1 retrieval accuracy to fall below the best prior method without biomimetic components.
Figures
read the original abstract
EEG-based visual neural decoding aims to align neural responses with visual stimuli for tasks such as image retrieval. However, limited paired data and a fundamental mismatch between high-fidelity digital images and biological visual perception - distorted by retinotopic mapping and subject-specific neuroanatomy - severely impede cross-modal alignment. To address this, we propose MB2L, a Multi-Level Bidirectional Biomimetic Learning framework that incorporates structured physiological inductive biases into representation learning. Specifically, we propose Adaptive Blur with Visual Priors to mitigate perceptual-structural mismatch by reweighting visual inputs according to retinotopic priors. We further propose Biomimetic Visual Feature Extraction to learn multi-level visual representations consistent with hierarchical cortical processing, enhancing subject-invariant encoding. These modules are jointly optimized via Multi-level Bidirectional Contrastive Learning, which aligns EEG and visual features in a shared semantic space through bidirectional contrastive objectives. Experiments show MB2L achieves 80.5% Top-1 and 97.6% Top-5 accuracy on zero-shot EEG-to-image retrieval, significantly outperforming prior methods and demonstrating strong generalization across subjects and experimental settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MB2L, a Multi-Level Bidirectional Biomimetic Learning framework for EEG-based visual decoding. It introduces Adaptive Blur with Visual Priors to reweight inputs according to retinotopic priors, Biomimetic Visual Feature Extraction for multi-level cortical-consistent representations, and joint optimization via Multi-level Bidirectional Contrastive Learning. The central claim is that this yields 80.5% Top-1 and 97.6% Top-5 accuracy on zero-shot EEG-to-image retrieval, significantly outperforming prior methods with strong generalization across subjects and settings.
Significance. If the performance claims and module contributions are rigorously validated, the work would advance EEG-to-image retrieval by embedding physiological priors into cross-modal alignment, with potential implications for brain-computer interfaces. The reported accuracies are high enough to suggest practical utility, but only if ablations and diagnostics confirm that the biomimetic components drive gains beyond standard contrastive learning on the dataset statistics.
major comments (2)
- [Introduction/Methods] Introduction and Methods: The claim that Adaptive Blur with Visual Priors and Biomimetic Visual Feature Extraction close the 'fundamental mismatch' between high-fidelity images and retinotopically distorted biological perception is load-bearing for the generalization and biomimetic framing. The manuscript provides no intermediate diagnostics (e.g., correlation of blurred features with V1/V2 EEG patterns, subject-specific retinotopic alignment error, or ablation isolating the priors from generic blur/pooling). Without these, it is unclear whether the modules contribute beyond data augmentation, weakening the central inductive-bias argument.
- [Experiments/Results] Experiments/Results: The abstract states clear outperformance (80.5% Top-1, 97.6% Top-5) and cross-subject generalization, but the provided text lacks details on baseline implementations, statistical tests (e.g., p-values, confidence intervals), dataset sizes, ablation studies, or controls for the contrastive objective alone. This makes it impossible to verify that the physiological modules, rather than the bidirectional loss on raw statistics, produce the gains.
minor comments (1)
- [Methods] Notation for the multi-level contrastive loss and the exact form of the Adaptive Blur reweighting should be formalized with equations in the Methods section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and have made revisions to strengthen the manuscript where the concerns are valid.
read point-by-point responses
-
Referee: [Introduction/Methods] Introduction and Methods: The claim that Adaptive Blur with Visual Priors and Biomimetic Visual Feature Extraction close the 'fundamental mismatch' between high-fidelity images and retinotopically distorted biological perception is load-bearing for the generalization and biomimetic framing. The manuscript provides no intermediate diagnostics (e.g., correlation of blurred features with V1/V2 EEG patterns, subject-specific retinotopic alignment error, or ablation isolating the priors from generic blur/pooling). Without these, it is unclear whether the modules contribute beyond data augmentation, weakening the central inductive-bias argument.
Authors: We agree that stronger intermediate diagnostics would better support the biomimetic framing. In the revised manuscript we have added an ablation isolating Adaptive Blur with Visual Priors from generic blur and no-blur baselines, showing consistent gains attributable to the retinotopic reweighting. We also include feature visualization and subject-specific performance breakdowns that demonstrate improved alignment with expected perceptual distortions. Direct correlation with V1/V2 EEG patterns is not feasible with the current dataset and recording montage, which lacks the spatial resolution for precise cortical localization; we have therefore expanded the discussion to clarify the design rationale drawn from established retinotopic and hierarchical models while acknowledging this limitation. revision: yes
-
Referee: [Experiments/Results] Experiments/Results: The abstract states clear outperformance (80.5% Top-1, 97.6% Top-5) and cross-subject generalization, but the provided text lacks details on baseline implementations, statistical tests (e.g., p-values, confidence intervals), dataset sizes, ablation studies, or controls for the contrastive objective alone. This makes it impossible to verify that the physiological modules, rather than the bidirectional loss on raw statistics, produce the gains.
Authors: We thank the referee for highlighting these omissions. The full manuscript already specifies the THINGS-EEG dataset sizes (10 subjects, trial counts per condition), re-implements baselines following their original papers, and reports ablation studies on the biomimetic modules. To directly address the concern, we have added p-values and 95% confidence intervals for the main retrieval metrics, plus a control ablation that applies only the bidirectional contrastive loss without the Adaptive Blur or Biomimetic Feature Extraction modules. This control shows a clear performance drop, supporting that the physiological components contribute beyond the loss function alone. These details and the new control experiment are now explicitly presented in the Experiments section and supplementary material. revision: yes
Circularity Check
No circularity: proposed modules and objective are independent inductive biases
full rationale
The paper introduces Adaptive Blur with Visual Priors, Biomimetic Visual Feature Extraction, and Multi-level Bidirectional Contrastive Learning as new components to incorporate physiological priors and align modalities. These are jointly optimized on data to produce empirical retrieval accuracies; no equations, self-definitions, or self-citations reduce any claimed result to its own inputs by construction. The central performance numbers are experimental outcomes, not predictions forced by the framework's own definitions or prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Structured physiological inductive biases (retinotopic mapping and hierarchical cortical processing) can be incorporated into representation learning to mitigate perceptual-structural mismatch between EEG and images.
Lean theorems connected to this paper
-
Cost.FunctionalEquation (J = ½(x+x⁻¹)−1)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
w(r) = σ_act(k(r − r_0)) where σ_act represents the activation function, k is the blur coefficient, r_0 represents the radius of the fovea ... k and r_0 are learnable parameters
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Why is there so much more research on vision than on any other sensory modality?
F. Hutmacher, “Why is there so much more research on vision than on any other sensory modality?”Frontiers in psychology, vol. 10, p. 481030, 2019
work page 2019
-
[2]
Neural structural underlying audiovisual working memory and visual dominance under cognitive load,
L. Jiayu, Z. Qiuzhu, L. Wenjuan, Z. Junjun, J. Zhenlan, and L. Ling, “Neural structural underlying audiovisual working memory and visual dominance under cognitive load,”Scientific Reports, vol. 15, no. 1, p. 32778, 2025
work page 2025
-
[3]
An attention-based bi-lstm method for visual object classification via eeg,
X. Zheng and W. Chen, “An attention-based bi-lstm method for visual object classification via eeg,”Biomedical Signal Processing and Control, vol. 63, p. 102174, 2021
work page 2021
-
[4]
Eeg2image: image reconstruction from eeg brain signals,
P. Singh, P. Pandey, K. Miyapuram, and S. Raman, “Eeg2image: image reconstruction from eeg brain signals,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. 9 Running Title for Header
work page 2023
-
[5]
Decoding natural images from eeg for object recognition
Y . Song, B. Liu, X. Li, N. Shi, Y . Wang, and X. Gao, “Decoding natural images from eeg for object recognition,”arXiv preprint arXiv:2308.13234, 2023
-
[6]
Y . Benchetrit, H. Banville, and J.-R. King, “Brain decoding: toward real-time reconstruction of visual perception,”arXiv preprint arXiv:2310.19812, 2023
- [7]
-
[8]
Alleviating the semantic gap for generalized fmri-to-image reconstruction,
T. Fang, Q. Zheng, and G. Pan, “Alleviating the semantic gap for generalized fmri-to-image reconstruction,”Advances in Neural Information Processing Systems, vol. 36, pp. 15 096–15 107, 2023
work page 2023
-
[9]
Human retinotopic mapping: From empirical to computational models of retinotopy,
F. L. Ribeiro, N. C. Benson, and A. M. Puckett, “Human retinotopic mapping: From empirical to computational models of retinotopy,”Journal of Vision, vol. 25, no. 8, pp. 14–14, 2025
work page 2025
-
[10]
Top-down perceptual inference shaping the activity of early visual cortex,
F. Csikor, B. Meszéna, K. Ócsai, and G. Orbán, “Top-down perceptual inference shaping the activity of early visual cortex,” Nature Communications, vol. 16, no. 1, p. 9998, 2025
work page 2025
-
[11]
Y . Miyawaki, H. Uchida, O. Yamashita, M.-a. Sato, Y . Morito, H. C. Tanabe, N. Sadato, and Y . Kamitani, “Visual image reconstruction from human brain activity using a combination of multiscale local image decoders,”Neuron, vol. 60, no. 5, pp. 915–929, 2008
work page 2008
-
[12]
Deep learning human mind for automated visual classification,
C. Spampinato, S. Palazzo, I. Kavasidis, D. Giordano, N. Souly, and M. Shah, “Deep learning human mind for automated visual classification,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6809–6817
work page 2017
-
[13]
P. S. Scotti, M. Tripathy, C. K. T. Villanueva, R. Kneeland, T. Chen, A. Narang, C. Santhirasegaran, J. Xu, T. Naselaris, K. A. Normanet al., “Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data,”arXiv preprint arXiv:2403.11207, 2024
-
[14]
Brainvis: Exploring the bridge between brain and visual signals via image reconstruction,
H. Fu, H. Wang, J. J. Chin, and Z. Shen, “Brainvis: Exploring the bridge between brain and visual signals via image reconstruction,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5
work page 2025
-
[15]
Visual neural decod- ing via improved visual-eeg semantic consistency
H. Chen, L. He, Y . Liu, and L. Yang, “Visual neural decoding via improved visual-eeg semantic consistency,”arXiv preprint arXiv:2408.06788, 2024
-
[16]
Decoding visual neural representations by multimodal learning of brain-visual-linguistic features,
C. Du, K. Fu, J. Li, and H. He, “Decoding visual neural representations by multimodal learning of brain-visual-linguistic features,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 760–10 777, 2023
work page 2023
-
[17]
Bridging the vision-brain gap with an uncertainty-aware blur prior,
H. Wu, Q. Li, C. Zhang, Z. He, and X. Ying, “Bridging the vision-brain gap with an uncertainty-aware blur prior,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2246–2257
work page 2025
-
[18]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Ed...
work page 2021
-
[19]
Vieeg: Hierarchical visual neural representation for eeg brain decoding,
M. Liu, D. Guan, C. Zheng, C. Tian, J. Wen, and Q. Zhu, “Vieeg: Hierarchical visual neural representation for eeg brain decoding,”arXiv preprint arXiv:2505.12408, 2025
-
[20]
Laion-5b: An open large-scale dataset for training next generation image-text models,
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,”Advances in neural information processing systems, vol. 35, pp. 25 278–25 294, 2022
work page 2022
-
[21]
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900
work page 2022
-
[22]
Scaling up visual and vision-language representation learning with noisy text supervision,
C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR...
work page 2021
-
[23]
Frozen in time: A joint video and image encoder for end-to-end retrieval,
M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1728–1738
work page 2021
-
[24]
Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,
L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1179–1189
work page 2023
-
[25]
Bayesian sampling in visual perception,
R. Moreno-Bote, D. C. Knill, and A. Pouget, “Bayesian sampling in visual perception,”Proceedings of the National Academy of Sciences, vol. 108, no. 30, pp. 12 491–12 496, 2011
work page 2011
-
[26]
Deep problems with neural network models of human vision,
J. S. Bowers, G. Malhotra, M. Dujmovi´c, M. L. Montero, C. Tsvetkov, V . Biscione, G. Puebla, F. Adolfi, J. E. Hummel, R. F. Heatonet al., “Deep problems with neural network models of human vision,”Behavioral and Brain Sciences, vol. 46, p. e385, 2023
work page 2023
-
[27]
H. Jang and F. Tong, “Improved modeling of human vision by incorporating robustness to blur in convolutional neural networks,” Nature Communications, vol. 15, no. 1, p. 1989, 2024. 10 Running Title for Header
work page 1989
-
[28]
W. Zhang, S. Wang, Y . Su, X. Li, C. Zhang, and S. Zhong, “Neurobridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirectional semantic alignment,”arXiv preprint arXiv:2511.06836, 2025
-
[29]
A large and rich eeg dataset for modeling human visual object recognition,
A. T. Gifford, K. Dwivedi, G. Roig, and R. M. Cichy, “A large and rich eeg dataset for modeling human visual object recognition,”NeuroImage, vol. 264, p. 119754, 2022
work page 2022
-
[30]
The representational dynamics of visual objects in rapid serial visual processing streams,
T. Grootswagers, A. K. Robinson, and T. A. Carlson, “The representational dynamics of visual objects in rapid serial visual processing streams,”NeuroImage, vol. 188, pp. 668–679, 2019
work page 2019
-
[31]
Rapid conceptual identification of sequentially presented pictures
H. Intraub, “Rapid conceptual identification of sequentially presented pictures.”Journal of Experimental Psychology: Human Perception and Performance, vol. 7, no. 3, p. 604, 1981
work page 1981
-
[32]
C. Keysers, D.-K. Xiao, P. Földiák, and D. I. Perrett, “The speed of sight,”Journal of cognitive neuroscience, vol. 13, no. 1, pp. 90–101, 2001
work page 2001
-
[33]
M. N. Hebart, O. Contier, L. Teichmann, A. H. Rockter, C. Y . Zheng, A. Kidder, A. Corriveau, M. Vaziri-Pashkam, and C. I. Baker, “Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior,”Elife, vol. 12, p. e82580, 2023
work page 2023
-
[34]
Deep learning with convolutional neural networks for eeg decoding and visualization,
R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard, and T. Ball, “Deep learning with convolutional neural networks for eeg decoding and visualization,”Human brain mapping, vol. 38, no. 11, pp. 5391–5420, 2017
work page 2017
-
[35]
Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces,
V . J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance, “Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces,”Journal of neural engineering, vol. 15, no. 5, p. 056013, 2018
work page 2018
-
[36]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
work page 2016
-
[37]
Visual decoding and reconstruction via eeg embeddings with guided diffusion
D. Li, C. Wei, S. Li, J. Zou, H. Qin, and Q. Liu, “Visual decoding and reconstruction via eeg embeddings with guided diffusion,” arXiv preprint arXiv:2403.07721, 2024
-
[38]
Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding
Y . Li, Z. Kang, S. Gong, W. Dong, W. Zeng, H. Yan, W. T. Siok, and N. Wang, “Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding,”arXiv preprint arXiv:2412.17337, 2024
-
[39]
Cognitioncapturer: Decoding visual stimuli from human eeg signal with multimodal information,
K. Zhang, L. He, X. Jiang, W. Lu, D. Wang, and X. Gao, “Cognitioncapturer: Decoding visual stimuli from human eeg signal with multimodal information,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 13, 2025, pp. 14 486–14 493
work page 2025
-
[40]
Ifcn standards for digital recording of clinical eeg,
M. R. Nuwer, G. Comi, R. Emerson, A. Fuglsang-Frederiksen, J.-M. Guérit, H. Hinrichs, A. Ikeda, F. J. C. Luccas, and P. Rappelsburger, “Ifcn standards for digital recording of clinical eeg,”Electroencephalography and clinical Neurophysiology, vol. 106, no. 3, pp. 259–261, 1998
work page 1998
-
[41]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”arXiv preprint arXiv:1711.05101, 2017. 11 Running Title for Header A Experimental details A.1 Datasets details THINGS-EEG [29] is a large-scale EEG dataset involving 10 subjects, collected using the Rapid Serial Visual Presentation (RSVP) paradigm [30, 31, 32]. The EEG data were recorded...
work page internal anchor Pith review arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.