pith. sign in

arxiv: 2605.23996 · v1 · pith:V2VVUAK3new · submitted 2026-05-18 · 💻 cs.CV · eess.IV

Brain-to-Image Retrieval and Reconstruction via Multimodal EEG Alignment

Pith reviewed 2026-06-30 19:00 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords EEG decodingimage retrievalimage reconstructionCLIP embeddingsbrain-to-imagemultimodal alignmentgenerative modelingSDXL-Turbo
0
0 comments X

The pith

EEG signals during natural image viewing can be aligned to multimodal CLIP embeddings for 86% top-1 retrieval accuracy and 0.903 CLIP-score reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that EEG recorded while a subject views natural images carries enough visual detail to support both ranking the correct stimulus among 200 candidates and generating a matching image. It does so by training a retrieval model that combines multi-level blurring with EVNet features under InfoNCE loss and a reconstruction model called CognitionCapturerPro that maps EEG to image, text, depth, and edge CLIP embeddings before conditioning SDXL-Turbo via IP-Adapter. The reported results average 86.30% top-1 and 98.55% top-5 retrieval accuracy and 0.903 CLIP score on reconstruction across ten random seeds for one subject. A reader would care because the work shows a concrete pipeline that turns raw brain signals into both identification and synthesis without extra physiological sensors.

Core claim

We present a brain-to-image system that decodes visual stimuli from EEG signals recorded during natural image viewing. For retrieval we implement a multi-level blurring approach improved with biologically inspired EVNet features and trained with the InfoNCE loss, reaching 86.30% top-1 and 98.55% top-5 accuracy. For reconstruction we implement CognitionCapturerPro, which aligns EEG representations to multi-modal CLIP embeddings including image, text, depth, and edge, and synthesizes images with SDXL-Turbo conditioned via IP-Adapter, reaching 0.903 CLIP score with ViT-H-14.

What carries the argument

CognitionCapturerPro, the alignment network that maps EEG segments to a joint space of image, text, depth, and edge CLIP embeddings so that those embeddings can directly condition a diffusion generator.

If this is right

  • The retrieval model can rank the correct image among 200 candidates at 86.30% top-1 and 98.55% top-5 accuracy.
  • The reconstruction model produces images whose CLIP embeddings match the stimulus at 0.903 with ViT-H-14 and 0.870 with ViT-L/14 while attaining 0.409 SSIM.
  • Both pipelines remain stable when averaged over ten random seeds on the same subject.
  • The system operates on EEG collected during ordinary image viewing without added physiological constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same alignment holds for additional subjects without recalibration, the approach could scale to subject-independent brain-to-image interfaces.
  • Pairing the retrieval and reconstruction stages in a single loop could support real-time visual feedback from brain signals.

Load-bearing premise

EEG segments recorded during natural image viewing contain sufficiently rich and stable visual representations that can be aligned to multi-modal CLIP embeddings without subject-specific recalibration or additional physiological constraints.

What would settle it

Applying the trained models to EEG recorded from a second subject without recalibration and obtaining top-1 retrieval accuracy near the random baseline of 0.5% would demonstrate that the representations are not stable enough for the claimed alignment.

Figures

Figures reproduced from arXiv: 2605.23996 by Chi Kit Wong, Haowen Yan, Yan Liu.

Figure 1
Figure 1. Figure 1: Image Retrieval Training Pipeline. Contrastive Training. We minimise the symmetric InfoNCE loss [3, 15] over EEG–image pairs within each mini-batch: L = − 1 2N X N i=1 " log e zi·vi P j e zi·vj + log e vi·zi P j e vi·zj # (2) In our implementation, the logits use a fixed scale of 1 rather than a learned temperature. At retrieval time, both EEG and visual embeddings are ℓ2-normalized before cosine-similarit… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of the Training Process of Image Reconstruction. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task 1 training dynamics. Left: Top-1 test accuracy vs. epoch. Right: InfoNCE training loss vs. epoch. Shaded region: ±1 std across 10 seeds; thin lines: individual seeds. 5.2 Task 2: Reconstruction Quality [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative reconstruction results (10 examples). Top row: ground-truth stimulus images. Middle row: images reconstructed by our pipeline from EEG signals (Seed 0). Bottom row: category labels. Categories span animals, objects, plants, and technology (left to right: Aircraft Carrier, Antelope, Banana, Cat, Elephant, Flamingo, Gorilla, Hummingbird, Orchid, Robot). Discussion of examples. • Aircraft carrier … view at source ↗
Figure 5
Figure 5. Figure 5: The RSVP experiment screenshot. Recreations uses the same colour of the background and the same centeral spot pattern, but the original image occupies 1/4 of the area. Refer to https://osf.io/3jk45/ Appendix C: Hungarian Retrieval In the retrieval task (Task 1), our formal results are formed by standard 200-way retrieval methods, which only calculates the best matching image for each of the EEG signals fro… view at source ↗
read the original abstract

We present a brain-to-image system that decodes visual stimuli from EEG signals recorded during natural image viewing. Our system addresses two tasks: (1) EEG-to-image retrieval, which ranks the correct stimulus image among 200 candidates given an EEG segment, and (2) EEG-to-image reconstruction, which generates an image consistent with the perceived stimulus. For retrieval, we implement a multi-level blurring approach improved with biologically inspired EVNet features and trained with the InfoNCE loss. Evaluated over 10 random seeds for a single subject, the retrieval model achieves a mean final-epoch Top-1 accuracy of 86.30% and Top-5 accuracy of 98.55%. For reconstruction, we implement CognitionCapturerPro, which aligns EEG representations to multi-modal CLIP embeddings, including image, text, depth, and edge embeddings, and synthesizes images with SDXL-Turbo conditioned via IP-Adapter. Averaged over 10 seeds, the reconstruction model achieves a CLIP score of 0.903 using ViT-H-14, a CLIP score of 0.870 using ViT-L/14, and an SSIM of 0.409. These results demonstrate the feasibility of decoding rich visual representations from EEG signals using modern multi-modal alignment and generative modeling techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a brain-to-image decoding system that performs two tasks on EEG signals recorded during natural image viewing: (1) retrieval of the correct image among 200 candidates using multi-level blurring, EVNet features, and InfoNCE loss; and (2) reconstruction via CognitionCapturerPro, which aligns EEG to multi-modal CLIP embeddings (image, text, depth, edge) and conditions SDXL-Turbo through IP-Adapter. All metrics are reported for a single subject averaged over 10 random seeds: retrieval reaches 86.30% Top-1 and 98.55% Top-5 accuracy; reconstruction reaches CLIP scores of 0.903 (ViT-H-14) and 0.870 (ViT-L/14) with SSIM 0.409.

Significance. If the single-subject empirical results prove robust and generalize, the work would demonstrate that modern multimodal contrastive alignment and generative models can extract usable visual information from EEG, offering a concrete technical path for non-invasive brain-to-image interfaces. The explicit use of multiple CLIP modalities (image/text/depth/edge) and the biologically inspired EVNet component constitute clear engineering contributions that could be built upon once subject-independence is established.

major comments (2)
  1. [Abstract] Abstract: All reported performance numbers (retrieval Top-1 86.30%, reconstruction CLIP 0.903) are obtained from a single subject with per-subject training; the absence of any cross-subject, leave-one-subject-out, or multi-subject evaluation directly undermines the central premise that EEG segments contain sufficiently rich and stable visual representations alignable to CLIP embeddings without subject-specific recalibration.
  2. [Abstract] Abstract (and implied Methods/Experiments): No information is supplied on dataset size, number of trials per subject, train-test split protocol, or any statistical testing; without these controls the claim that the multi-level blurring + InfoNCE pipeline and CognitionCapturerPro alignment succeed cannot be evaluated for overfitting or reproducibility.
minor comments (1)
  1. [Abstract] Abstract: The phrase "evaluated over 10 random seeds for a single subject" should be accompanied by standard deviations or confidence intervals for the reported means to allow assessment of stability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the single-subject scope and the need for greater transparency on experimental controls. We respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: All reported performance numbers (retrieval Top-1 86.30%, reconstruction CLIP 0.903) are obtained from a single subject with per-subject training; the absence of any cross-subject, leave-one-subject-out, or multi-subject evaluation directly undermines the central premise that EEG segments contain sufficiently rich and stable visual representations alignable to CLIP embeddings without subject-specific recalibration.

    Authors: We agree the reported results use single-subject, per-subject training. The central premise of the work is to demonstrate technical feasibility of multimodal contrastive alignment and generative decoding from EEG using current methods; single-subject evaluation is a standard first step in EEG visual decoding literature to isolate the signal before tackling inter-subject variability. We will revise the abstract to explicitly note the single-subject design and add a limitations paragraph outlining the path toward subject-independent models. revision: yes

  2. Referee: [Abstract] Abstract (and implied Methods/Experiments): No information is supplied on dataset size, number of trials per subject, train-test split protocol, or any statistical testing; without these controls the claim that the multi-level blurring + InfoNCE pipeline and CognitionCapturerPro alignment succeed cannot be evaluated for overfitting or reproducibility.

    Authors: We will update the abstract to include key dataset statistics (number of trials, single-subject scope) and explicitly state the 10-seed averaging protocol. The methods section already details the train-test splits and evaluation; we will add a short reproducibility subsection summarizing these controls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML pipeline reports trained-model metrics without self-referential derivation

full rationale

The paper presents a standard supervised learning system (multi-level blurring + EVNet + InfoNCE for retrieval; CognitionCapturerPro + multi-modal CLIP alignment + SDXL-Turbo for reconstruction) and directly reports performance numbers obtained by training and evaluating on the EEG-image pairs. No mathematical derivation chain, first-principles claim, or uniqueness theorem is asserted that reduces to the inputs by construction. No self-citations, ansatzes, or fitted-parameter renamings appear in the provided text. The reported accuracies are the expected output of the training procedure itself, not a circular reduction of an independent prediction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that EEG encodes decodable visual content and on the standard machine-learning premise that contrastive and generative objectives can recover that content; no new physical entities are introduced.

free parameters (2)
  • InfoNCE temperature and other contrastive hyperparameters
    Standard tunable scalars in the loss used for retrieval training.
  • Alignment weights across CLIP modalities and IP-Adapter conditioning strength
    Multiple scaling factors chosen during training of the reconstruction model.
axioms (1)
  • domain assumption EEG signals recorded during natural image viewing contain stable, decodable visual information sufficient for alignment to CLIP embeddings
    Invoked as the enabling premise for both retrieval and reconstruction tasks in the abstract.

pith-pipeline@v0.9.1-grok · 5763 in / 1534 out tokens · 32665 ms · 2026-06-30T19:00:14.065570+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 17 canonical work pages · 7 internal anchors

  1. [1]

    Leveraging visual blur perception characteristics for eeg decoding.Proceedings of the AAAI Conference on Artificial Intelligence, 40(21):17580–17588, Mar

    Wenchao Liu, Hongwei Li, Zhouyang Xu, Lin Ma, and Haifeng Li. Leveraging visual blur perception characteristics for eeg decoding.Proceedings of the AAAI Conference on Artificial Intelligence, 40(21):17580–17588, Mar. 2026. doi: 10.1609/aaai.v40i21.38813. URLhttps: //ojs.aaai.org/index.php/AAAI/article/view/38813

  2. [2]

    Oliveira, and Tiago Marques

    Lucas Piper, Arlindo L. Oliveira, and Tiago Marques. Explicitly modeling subcortical vision with a neuro-inspired front-end improves cnn robustness, 2025. URLhttps://arxiv.org/ abs/2506.03089

  3. [3]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2019. URLhttps://arxiv.org/abs/1807.03748

  4. [4]

    Cognition- capturerpro: Towards high-fidelity visual decoding from eeg/meg via multi-modal information and asymmetric alignment, 2026

    Kaifan Zhang, Lihuo He, Junjie Ke, Yuqi Ji, Lukun Wu, Lizi Wang, and Xinbo Gao. Cognition- capturerpro: Towards high-fidelity visual decoding from eeg/meg via multi-modal information and asymmetric alignment, 2026. URLhttps://arxiv.org/abs/2603.12722

  5. [5]

    arXiv preprint arXiv:2311.17042 , year=

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation, 2023. URLhttps://arxiv.org/abs/2311.17042

  6. [6]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models, 2023. URLhttps://arxiv.org/abs/ 2308.06721

  7. [7]

    Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M

    Alessandro T. Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M. Cichy. A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022. 13 ISSN 1053-8119. doi: https://doi.org/10.1016/j.neuroimage.2022.119754. URLhttps://www. sciencedirect.com/science/article/pii/S1053811922008758

  8. [8]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

  9. [9]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612,

  10. [10]

    doi: 10.1109/TIP.2003.819861

  11. [11]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K. Wein- berger, editors,Advances in Neural Information Processing Systems, volume 25. Curran As- sociates, Inc., 2012. URLhttps://proceedings.neurips.cc/paper_files/paper/2012/ file/c399862d3b...

  12. [12]

    Unsupervised learning of visual features by contrasting cluster assignments, 2021

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments, 2021. URLhttps://arxiv.org/abs/2006.09882

  13. [13]

    Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks, 2020. URLhttps://arxiv.org/abs/1905.11946

  14. [14]

    In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 2818–2829. IEEE, 2023. doi: 10.1109/cvpr52729.2023. 00276...

  15. [15]

    Cox, and James J

    Joel Dapello, Tiago Marques, Martin Schrimpf, Franziska Geiger, David D. Cox, and James J. DiCarlo. Simulating a primary visual cortex at the front of CNNs improves robustness to image perturbations. InAdvances in Neural Information Processing Systems, 2020

  16. [16]

    A simple framework for contrastive learning of visual representations, 2020

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020. URLhttps://arxiv.org/abs/2002. 05709

  17. [17]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952

  18. [18]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps: //arxiv.org/abs/1711.05101

  19. [19]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmar- czyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text model...

  20. [20]

    Visual decoding and reconstruction via eeg embeddings with guided diffusion, 2024

    Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, Haoyang Qin, and Quanying Liu. Visual decoding and reconstruction via eeg embeddings with guided diffusion, 2024. URLhttps: //arxiv.org/abs/2403.07721

  21. [21]

    H. W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97, 1955. doi: https://doi.org/10.1002/nav.3800020109. URLhttps: //onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109. Appendix A: Team Contribution Statement •Chi Kit WONG: Participated in the coding work for Task 1 and Task 2, presented the Task...