MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding

Alexandra Woolgar; Huichao Qi; Lihui Wang; Runhao Lu; Sichao Liu; Xi Vincent Wang; Zexuan Chen

arxiv: 2605.24523 · v1 · pith:ASLHY7QOnew · submitted 2026-05-23 · 💻 cs.LG · cs.CL· q-bio.NC

MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding

Zexuan Chen , Sichao Liu , Runhao Lu , Huichao Qi , Alexandra Woolgar , Xi Vincent Wang , Lihui Wang This is my paper

Pith reviewed 2026-06-30 14:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CLq-bio.NC

keywords EEG visual decodingzero-shot classificationmultimodal contrastive learningbrain signal alignmentThings-EEG2 benchmarkmasked reconstruction pre-trainingsemantic regularization

0 comments

The pith

A two-stage tri-modal contrastive method aligns EEG, image, and text embeddings to decode visual categories from brain signals without training examples for the target classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that first pre-trains an EEG encoder on unlabeled data using masked reconstruction to capture spatio-temporal patterns, then fine-tunes it by contrasting EEG representations against both visual features and LLM-generated text descriptions in a shared space. Text acts as a semantic regularizer to structure the alignment without replacing the primary EEG-image signal. On the Things-EEG2 dataset this yields 54.1 percent top-1 and 83.4 percent top-5 accuracy in 200-way zero-shot classification, more than doubling the previous best reported figures, with statistical tests confirming the gains hold across subjects. The same pipeline generalizes when tested on MEG data. The authors also observe that compact embedding spaces outperform larger vision backbones and that the learned mappings respect known stages of visual cortex processing.

Core claim

By pre-training an EEG encoder via masked reconstruction and then performing joint contrastive alignment of EEG, image, and LLM text embeddings, the method produces a latent space in which EEG signals can be matched to novel visual categories at 54.1 percent top-1 accuracy on a 200-way benchmark, exceeding prior EEG-only baselines by a wide margin while preserving neurophysiologically plausible structure.

What carries the argument

A two-stage tri-modal contrastive alignment in which an EEG encoder (graph attention plus convolutional embeddings plus subject adaptation) is first pre-trained by masked reconstruction and then jointly contrasted against image and LLM-text representations so that text supplies semantic structure while EEG-image pairing remains the primary objective.

If this is right

EEG signals become usable for zero-shot retrieval among hundreds of object categories without per-category training data.
The same encoder transfers to MEG recordings with comparable gains.
Compact embedding geometries (rather than the largest available vision models) yield the strongest decoding performance.
Decoded representations respect the temporal hierarchy of visual cortex responses.
Subject-specific adaptation layers allow the model to maintain performance across individuals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the text regularizer proves robust across languages or description styles, the same pipeline could be applied to other non-invasive signals such as fNIRS without new labeled datasets.
The observed superiority of compact embeddings suggests that deployment on wearable EEG hardware may be feasible without large on-device models.
Because the method already shows alignment with known visual processing stages, targeted experiments could test whether the learned embeddings predict specific perceptual phenomena such as category typicality or viewpoint invariance.
Extending the pre-training stage to include more unlabeled EEG corpora from clinical or consumer devices could further reduce the amount of task-specific data required.

Load-bearing premise

LLM-generated text descriptions supply a semantic regularizer that improves EEG-to-image alignment without adding noise or bias that would degrade the primary visual decoding signal.

What would settle it

An independent replication on the identical Things-EEG2 200-way split that finds no statistically significant lift above the prior 32.4 percent top-1 baseline after the same two-stage training procedure.

Figures

Figures reproduced from arXiv: 2605.24523 by Alexandra Woolgar, Huichao Qi, Lihui Wang, Runhao Lu, Sichao Liu, Xi Vincent Wang, Zexuan Chen.

**Figure 1.** Figure 1: Overview of the framework and decoding performance on Things-EEG2. (a) Tri-modal contrastive alignment. EEG signals, visual stimuli, and LLM-generated descriptions are encoded into a shared feature space, where corresponding triplets are aligned through contrastive learning and mismatched samples are separated. At inference, an EEG embedding retrieves the most similar image and text candidates in this spac… view at source ↗

**Figure 2.** Figure 2: Framework overview. Stage 1 (pre-training): EEG signals are split and partially masked with noise, and reconstructed by a lightweight decoder from the encoder’s latents, driving the encoder to learn intrinsic neural dynamics. Stage 2 (tri-modal alignment): the pre-trained EEG encoder is jointly trained with frozen image and text encoders, where text descriptions are generated by an LLM from visual content… view at source ↗

**Figure 3.** Figure 3: EEG decoding performance on the Things-EEG2 dataset. Left: In-subject comparison across five methods; Right: Crosssubject (leave-one-subject-out) comparison across four methods. (see details in Tables 9 & 10 in Appendix C, respectively) We evaluate our framework for EEG-to-image recognition under two protocols. In the in-subject setting, the model is trained and tested on data from the same participant. … view at source ↗

**Figure 4.** Figure 4: Semantic structure of learned EEG representations. Left: cosine-similarity matrix of EEG embeddings over 200 test concepts (averaged across 10 subjects); the block-diagonal pattern reveals intra-category clustering. Right: top-5 image retrievals per category (correct matches in red); near-miss errors (e.g., cruise ship → ferry) reflect category-level proximity. consistently fall within the ground-truth cat… view at source ↗

**Figure 5.** Figure 5: Temporal, spatial, and spectral analyses on EEG. (a) Electrode layout, color-coded by anatomical region. (b) Spatial decoding by region: occipital sensors dominate, followed by temporal and parietal. (c) Temporal decoding under cumulative [0, t], sliding [t−100, t], and post-onset [t, 1000] ms windows. (d) Spectral decoding across δ, θ, α, β, γ, and full-band. Temporal, spatial, and spectral dynamics. To … view at source ↗

**Figure 6.** Figure 6: An example of masked EEG input and reconstructed result for a randomly selected channel from one trial. The reconstructed waveform captures the main low-frequency trends of the original signal, while fine-grained details remain limited by the inherent noise of EEG recordings. • LLaVA-1.5-7B: Green apples hang from a tree. • Qwen2-VL-7B: Two green apples hang from an apple tree, surrounded by leaves and bra… view at source ↗

**Figure 7.** Figure 7: Descriptions generated for an image using different LLMs. Red indicates the object label, and blue indicates object details. Qwen2-VL-7B generates more detailed, context-rich descriptions, capturing attributes such as quantity and surrounding elements, whereas LLaVA-1.5-7B tends to produce more concise descriptions focused on the primary object. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative Top-1 retrieval results obtained with different visual and EEG encoders, with the ground-truth image shown in the first row. The results are generated following the same experimental configuration as those evaluated in Tables 11 and 13 Appendix [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Representational similarity matrices across 10 subjects. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Topographies of EEG signals averaged across all trials for Subject 1 at 100 ms intervals. A clear response is observed in the occipital area (0-100 ms), followed by activity in the temporal area (100-600 ms) after stimulus onset. The 200-ms SOA still induces periodic responses in the occipital cortex. Frontal activity gradually increases, possibly reflecting additional cognitive processes. 20 [PITH_FULL_… view at source ↗

read the original abstract

Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. We introduce a tri-modal contrastive framework for EEG-based visual decoding that aligns EEG, visual, and textual representations within a unified latent space. Our approach follows a two-stage design. First, we pre-train an EEG encoder via masked reconstruction on unlabeled trials, learning spatio-temporal regularities that transfer robustly to downstream tasks. Second, we jointly align EEG, image, and LLM-generated textual descriptions through contrastive learning, where text supervision acts as a semantic regularizer that injects linguistic structure into the shared space without overwhelming the primary EEG-image signal. The encoder integrates subject-specific adaptation, graph-attention over channels, and temporal-spatial convolutional embeddings. On the Things-EEG2 200-way zero-shot benchmark, our framework achieves 54.1% Top-1 and 83.4% Top-5 accuracy, substantially exceeding the strongest prior baseline (32.4% / 64.0%), with paired Wilcoxon tests confirming significance (p < 0.01) over all in-subject baselines. We validate generalization on Things-MEG. Analysis reveals that compact embedding geometries (CN-CLIP) outperform much larger backbones, and that decoding aligns with established neurophysiology of visual processing. This work is a critical step towards robust, semantically-grounded visual decoding from non-invasive temporal neural signals. The source code is publicly available in https://github.com/anon-eeg/eeg_image_decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 54% top-1 claim on Things-EEG2 is the headline number, but the tri-modal text regularizer needs explicit controls before the gain can be trusted as EEG-driven rather than caption-driven.

read the letter

The main takeaway is a reported lift from 32.4% to 54.1% top-1 (and 64% to 83.4% top-5) on the 200-way zero-shot Things-EEG2 benchmark, with Wilcoxon significance. The method uses masked pre-training on EEG followed by contrastive alignment of EEG, image, and LLM-generated text embeddings, plus graph attention and subject adaptation in the encoder.

What the paper does cleanly is lay out a two-stage pipeline that treats text as a semantic regularizer rather than the main driver, releases code, checks a second dataset (Things-MEG), and notes that compact CN-CLIP embeddings outperform larger backbones. Those pieces are concrete and reproducible on the surface.

The soft spot is the one flagged in the stress test. The abstract asserts that text injects structure without overwhelming the EEG-image signal, yet the 22-point gain could arise if the generated captions carry class-specific information that EEG does not. No ablation of the text component, no mismatched-caption control, and no cross-LLM consistency check are described in the available text. Without those, the statistical test does not isolate the neurophysiological contribution. The low in the reader's report is warranted because only the abstract is visible here; full methods, splits, and caption generation details are needed to judge.

This is for labs working on non-invasive visual decoding and multimodal BCI. A reader who wants the latest benchmark numbers and a working pipeline description will find it useful, even if they plan to rerun the controls themselves.

It should go to peer review. The performance delta is large enough and the framing is specific enough that referees can usefully pressure the text-regularizer claim rather than reject outright.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MindAlign, a tri-modal contrastive framework for zero-shot visual decoding from EEG. It employs a two-stage pipeline: (1) masked reconstruction pre-training of an EEG encoder to learn spatio-temporal features, and (2) joint contrastive alignment of EEG, image, and LLM-generated textual embeddings, with text acting as a semantic regularizer. On the Things-EEG2 200-way zero-shot benchmark the method reports 54.1% Top-1 / 83.4% Top-5 accuracy, exceeding the strongest prior baseline (32.4% / 64.0%) with paired Wilcoxon significance (p < 0.01); generalization to Things-MEG is claimed and code is released.

Significance. If the reported gains are shown to derive from EEG-visual alignment rather than linguistic artifacts, the work would mark a substantial empirical advance in non-invasive visual decoding by demonstrating that compact multimodal embeddings can substantially outperform prior EEG-only approaches while aligning with known neurophysiology. Public code release strengthens reproducibility.

major comments (2)

[Abstract and two-stage design] Abstract / two-stage design description: the claim that LLM-generated textual descriptions 'inject linguistic structure into the shared space without overwhelming the primary EEG-image signal' is load-bearing for the 22-point accuracy improvement. No ablation that removes the text modality, varies caption sources, or blinds caption generation is reported, leaving open the possibility that class-specific correlations between generated captions and image categories (rather than EEG signal) drive the result.
[Results (Things-EEG2)] Results section on Things-EEG2 benchmark: the paired Wilcoxon tests establish statistical significance over in-subject baselines, yet without quantitative controls (e.g., caption-only or image-only contrastive runs, or cross-LLM consistency checks) it is impossible to isolate whether the 54.1% Top-1 figure reflects neurophysiological information or injected linguistic bias.

minor comments (2)

[Abstract] The abstract states validation on Things-MEG but supplies no numerical results; including these metrics would clarify the generalization claim.
[Methods (encoder architecture)] Notation for the graph-attention and temporal-spatial convolutional components of the encoder would benefit from an explicit diagram or equation reference to aid replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We appreciate the emphasis on isolating the contributions of each modality. Below we respond to the major comments and outline planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and two-stage design] Abstract / two-stage design description: the claim that LLM-generated textual descriptions 'inject linguistic structure into the shared space without overwhelming the primary EEG-image signal' is load-bearing for the 22-point accuracy improvement. No ablation that removes the text modality, varies caption sources, or blinds caption generation is reported, leaving open the possibility that class-specific correlations between generated captions and image categories (rather than EEG signal) drive the result.

Authors: We agree that the absence of ablations isolating the text modality leaves the contribution of linguistic supervision open to question. The two-stage design positions text as a semantic regularizer, but to rigorously demonstrate that gains derive from EEG-visual alignment, we will add the suggested ablations: (1) a bi-modal EEG-image contrastive baseline, (2) caption-only alignment runs, and (3) consistency checks across different LLMs for caption generation. These will be included in the revised manuscript to quantify the incremental benefit of the tri-modal setup. revision: yes
Referee: [Results (Things-EEG2)] Results section on Things-EEG2 benchmark: the paired Wilcoxon tests establish statistical significance over in-subject baselines, yet without quantitative controls (e.g., caption-only or image-only contrastive runs, or cross-LLM consistency checks) it is impossible to isolate whether the 54.1% Top-1 figure reflects neurophysiological information or injected linguistic bias.

Authors: The statistical tests confirm that our method outperforms prior EEG-only baselines. However, we recognize that without the additional controls mentioned, it is difficult to fully attribute the performance to neurophysiological signals versus potential linguistic artifacts. We will incorporate the quantitative controls (caption-only, image-only, and cross-LLM checks) into the results section of the revised version to address this concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of inputs

full rationale

The paper reports measured Top-1/Top-5 accuracies on the Things-EEG2 200-way zero-shot benchmark after a two-stage training procedure (masked reconstruction pre-training followed by tri-modal contrastive alignment). These are direct empirical outcomes on held-out test data rather than quantities derived from equations, fitted parameters renamed as predictions, or self-citations. No load-bearing mathematical derivations, uniqueness theorems, or ansatzes are present in the provided text that reduce to the inputs by construction. The LLM text descriptions function as an auxiliary regularizer in the contrastive loss but do not create a self-definitional loop with the reported accuracies.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions of contrastive learning (that aligned embeddings capture semantic similarity) and the validity of using LLM-generated text as a proxy for semantic structure in EEG data. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Contrastive loss aligns representations such that matching EEG-image-text triples are closer than non-matching ones in the shared space.
Invoked in the description of the joint alignment stage.
domain assumption LLM-generated textual descriptions provide a reliable semantic regularizer without overwhelming the EEG-image signal.
Stated explicitly in the two-stage design.

pith-pipeline@v0.9.1-grok · 5842 in / 1267 out tokens · 36379 ms · 2026-06-30T14:49:52.437450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 18 canonical work pages · 6 internal anchors

[1]

Decoding the brain: From neural representations to mechanistic models

Mackenzie Weygandt Mathis, Adriana Perez Rotondo, Edward F Chang, Andreas S Tolias, and Alexander Mathis. Decoding the brain: From neural representations to mechanistic models. Cell, 187(21):5814–5832, 2024

2024
[2]

Using goal-driven deep learning models to understand sensory cortex.Nature neuroscience, 19(3):356–365, 2016

Daniel LK Yamins and James J DiCarlo. Using goal-driven deep learning models to understand sensory cortex.Nature neuroscience, 19(3):356–365, 2016

2016
[3]

Decoding the visual and subjective contents of the human brain.Nature neuroscience, 8(5):679–685, 2005

Yukiyasu Kamitani and Frank Tong. Decoding the visual and subjective contents of the human brain.Nature neuroscience, 8(5):679–685, 2005

2005
[4]

Identifying natural images from human brain activity.Nature, 452(7185):352–355, 2008

Kendrick N Kay, Thomas Naselaris, Ryan J Prenger, and Jack L Gallant. Identifying natural images from human brain activity.Nature, 452(7185):352–355, 2008

2008
[5]

Visual image reconstruction from human brain activity using a combination of multiscale local image decoders.Neuron, 60(5):915–929, 2008

Yoichi Miyawaki, Hajime Uchida, Okito Yamashita, Masa-aki Sato, Yusuke Morito, Hiroki C Tanabe, Norihiro Sadato, and Yukiyasu Kamitani. Visual image reconstruction from human brain activity using a combination of multiscale local image decoders.Neuron, 60(5):915–929, 2008

2008
[6]

Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014

Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014

2014
[7]

Deep neural networks: a new framework for modeling biological vision and brain information processing.Annual review of vision science, 1:417–446, 2015

Nikolaus Kriegeskorte. Deep neural networks: a new framework for modeling biological vision and brain information processing.Annual review of vision science, 1:417–446, 2015

2015
[8]

Brains and algorithms partially converge in natural language processing.Communications biology, 5(1):134, 2022

Charlotte Caucheteux and Jean-Rémi King. Brains and algorithms partially converge in natural language processing.Communications biology, 5(1):134, 2022

2022
[9]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[10]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

2024
[11]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

Alessandro T Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M Cichy. A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

2022
[13]

Human eeg recordings for 1,854 concepts presented in rapid serial visual presentation streams

Tijl Grootswagers, Ivy Zhou, Amanda K Robinson, Martin N Hebart, and Thomas A Carlson. Human eeg recordings for 1,854 concepts presented in rapid serial visual presentation streams. Scientific Data, 9(1):3, 2022

2022
[14]

Decoding natural images from eeg for object recognition.arXiv preprint arXiv:2308.13234, 2023

Y . Song et al. Decoding natural images from eeg for object recognition.arXiv preprint arXiv:2308.13234, 2023

work page arXiv 2023
[15]

Recognizing natural images from eeg with language-guided contrastive learning.IEEE Transactions on Neural Networks and Learning Systems, 2025

Yonghao Song, Yijun Wang, Huiguang He, and Xiaorong Gao. Recognizing natural images from eeg with language-guided contrastive learning.IEEE Transactions on Neural Networks and Learning Systems, 2025

2025
[16]

Visual decoding and reconstruction via eeg embeddings with guided diffusion, 2024

Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, Haoyang Qin, and Quanying Liu. Vi- sual decoding and reconstruction via eeg embeddings with guided diffusion.arXiv preprint arXiv:2403.07721, 2024

work page arXiv 2024
[17]

Bridging the vision-brain gap with an uncertainty-aware blur prior

Haitao Wu, Qing Li, Changqing Zhang, Zhen He, and Xiaomin Ying. Bridging the vision-brain gap with an uncertainty-aware blur prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2246–2257, 2025. 11

2025
[18]

Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding

Yueyang Li, Zijian Kang, Shengyu Gong, Wenhao Dong, Weiming Zeng, Hongjie Yan, Wai Ting Siok, and Nizhuan Wang. Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

2025
[19]

Mapping human brain function with meg and eeg: methods and validation.NeuroImage, 23:S289–S299, 2004

Felix Darvas, D Pantazis, E Kucukaltun-Yildirim, and RM Leahy. Mapping human brain function with meg and eeg: methods and validation.NeuroImage, 23:S289–S299, 2004

2004
[20]

Classification of eeg signals based on pattern recognition approach

Hafeez Ullah Amin, Wajid Mumtaz, Ahmad Rauf Subhani, Mohamad Naufal Mohamad Saad, and Aamir Saeed Malik. Classification of eeg signals based on pattern recognition approach. Frontiers in computational neuroscience, 11:103, 2017

2017
[21]

A review of issues related to data acquisition and analysis in eeg/meg studies.Brain sciences, 7(6):58, 2017

Aina Puce and Matti S Hämäläinen. A review of issues related to data acquisition and analysis in eeg/meg studies.Brain sciences, 7(6):58, 2017

2017
[22]

A common, high-dimensional model of the representational space in human ventral temporal cortex.Neuron, 72(2):404–416, 2011

James V Haxby, J Swaroop Guntupalli, Andrew C Connolly, Yaroslav O Halchenko, Bryan R Conroy, M Ida Gobbini, Michael Hanke, and Peter J Ramadge. A common, high-dimensional model of the representational space in human ventral temporal cortex.Neuron, 72(2):404–416, 2011

2011
[23]

Changde Du, Kaicheng Fu, Jinpeng Li, and Huiguang He. Decoding visual neural representa- tions by multimodal learning of brain-visual-linguistic features.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10760–10777, 2023

2023
[24]

Reve: A foundation model for eeg–adapting to any setup with large-scale pretraining on 25,000 subjects.arXiv preprint arXiv:2510.21585, 2025

Yassine El Ouahidi, Jonathan Lys, Philipp Thölke, Nicolas Farrugia, Bastien Pasdeloup, Vincent Gripon, Karim Jerbi, and Giulia Lioi. Reve: A foundation model for eeg–adapting to any setup with large-scale pretraining on 25,000 subjects.arXiv preprint arXiv:2510.21585, 2025

work page arXiv 2025
[25]

Thd-bar: Topology hierarchical derived brain autoregressive modeling for eeg generic representations.arXiv preprint arXiv:2511.13733, 2025

Wenchao Yang, Weidong Yan, Wenkang Liu, Yulan Ma, and Yang Li. Thd-bar: Topology hierarchical derived brain autoregressive modeling for eeg generic representations.arXiv preprint arXiv:2511.13733, 2025

work page arXiv 2025
[26]

Spiced: A synaptic homeostasis-inspired framework for unsupervised continual eeg decoding

Yangxuan Zhou, Sha Zhao, Jiquan Wang, Haiteng Jiang, Shijian Li, Tao Li, and Gang Pan. Spiced: A synaptic homeostasis-inspired framework for unsupervised continual eeg decoding. arXiv preprint arXiv:2509.17439, 2025

work page arXiv 2025
[27]

Encoding and decoding in fmri.Neuroimage, 56(2):400–410, 2011

Thomas Naselaris, Kendrick N Kay, Shinji Nishimoto, and Jack L Gallant. Encoding and decoding in fmri.Neuroimage, 56(2):400–410, 2011

2011
[28]

Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532 (7600):453–458, 2016

Alexander G Huth, Wendy A De Heer, Thomas L Griffiths, Frédéric E Theunissen, and Jack L Gallant. Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532 (7600):453–458, 2016

2016
[29]

Brain-score: Which artificial neural network for object recognition is most brain-like?BioRxiv, page 407007, 2018

Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J Majaj, Rishi Rajalingham, Elias B Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Franziska Geiger, et al. Brain-score: Which artificial neural network for object recognition is most brain-like?BioRxiv, page 407007, 2018

2018
[30]

Deep learning human mind for automated visual classification

Concetto Spampinato, Simone Palazzo, Isaak Kavasidis, Daniela Giordano, Nasim Souly, and Mubarak Shah. Deep learning human mind for automated visual classification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6809–6817, 2017

2017
[31]

Brain decoding: toward real-time reconstruction of visual perception.arXiv preprint arXiv:2310.19812, 2023

Yohann Benchetrit, Hubert Banville, and Jean-Rémi King. Brain decoding: toward real-time reconstruction of visual perception.arXiv preprint arXiv:2310.19812, 2023

work page arXiv 2023
[32]

Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

Paul S Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

work page arXiv 2024
[33]

Mindbridge: A cross-subject brain decoding framework

Shizun Wang, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Mindbridge: A cross-subject brain decoding framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11333–11342, 2024. 12

2024
[34]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[35]

Enigma: A unified lightweight eeg-to-image model for multi-subject visual decoding

Reese Kneeland, Wangshu Jiang, Ugo Bruzadin Nunes, Si Kai Lee, Paul Steven Scotti, Arnaud Delorme, and Jonathan Xu. Enigma: A unified lightweight eeg-to-image model for multi-subject visual decoding. InNeurIPS 2025 Workshop on Foundation Models for the Brain and Body, 2025

2025
[36]

Deep learning with convolutional neural networks for eeg decoding and visualization

Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wolfram Burgard, and Tonio Ball. Deep learning with convolutional neural networks for eeg decoding and visualization. Human brain mapping, 38(11):5391–5420, 2017

2017
[37]

Lstm-based eeg classification in motor imagery tasks.IEEE transactions on neural systems and rehabilitation engineering, 26(11):2086–2095, 2018

Ping Wang, Aimin Jiang, Xiaofeng Liu, Jing Shang, and Li Zhang. Lstm-based eeg classification in motor imagery tasks.IEEE transactions on neural systems and rehabilitation engineering, 26(11):2086–2095, 2018

2086
[38]

Eeg-based emotion recognition using regularized graph neural networks.IEEE Transactions on Affective Computing, 13(3):1290–1301, 2020

Peixiang Zhong, Di Wang, and Chunyan Miao. Eeg-based emotion recognition using regularized graph neural networks.IEEE Transactions on Affective Computing, 13(3):1290–1301, 2020

2020
[39]

Eeg-gnn: Graph neural networks for classification of electroencephalogram (eeg) signals

Andac Demir, Toshiaki Koike-Akino, Ye Wang, Masaki Haruna, and Deniz Erdogmus. Eeg-gnn: Graph neural networks for classification of electroencephalogram (eeg) signals. In2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 1061–1067. IEEE, 2021

2021
[40]

Graph Attention Networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks.arXiv preprint arXiv:1710.10903, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

How Attentive are Graph Attention Networks?

Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks?arXiv preprint arXiv:2105.14491, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

Predicting brain activity using trans- formers.bioRxiv, pages 2023–08, 2023

Hossein Adeli, Sun Minni, and Nikolaus Kriegeskorte. Predicting brain activity using trans- formers.bioRxiv, pages 2023–08, 2023

2023
[43]

The Wisdom of a Crowd of Brains: A Universal Brain Encoder

Roman Beliy, Navve Wasserman, Amit Zalcher, and Michal Irani. The wisdom of a crowd of brains: A universal brain encoder.arXiv preprint arXiv:2406.12179, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Neuro-3d: Towards 3d visual decoding from eeg signals

Zhanqiang Guo, Jiamin Wu, Yonghao Song, Jiahui Bu, Weijian Mai, Qihao Zheng, Wanli Ouyang, and Chunfeng Song. Neuro-3d: Towards 3d visual decoding from eeg signals. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23870–23880, 2025

2025
[45]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022
[46]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

2019
[47]

Meta-learning in neural networks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44 (9):5149–5169, 2021

Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44 (9):5149–5169, 2021

2021
[48]

Learnable latent embeddings for joint behavioural and neural analysis.Nature, 617(7960):360–368, 2023

Steffen Schneider, Jin Hwa Lee, and Mackenzie Weygandt Mathis. Learnable latent embeddings for joint behavioural and neural analysis.Nature, 617(7960):360–368, 2023

2023
[49]

Foundation model of neural activity predicts response to new stimulus types.Nature, 640(8058):470–477, 2025

Eric Y Wang, Paul G Fahey, Zhuokun Ding, Stelios Papadopoulos, Kayla Ponder, Marissa A Weis, Andersen Chang, Taliah Muhammad, Saumil Patel, Zhiwei Ding, et al. Foundation model of neural activity predicts response to new stimulus types.Nature, 640(8058):470–477, 2025

2025
[50]

A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025

Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian Coda- Forno, Peter Dayan, Can Demircan, Maria K Eckstein, Noémi Éltet˝o, et al. A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025. 13

2025
[51]

Tribe: Trimodal brain encoder for whole-brain fmri response prediction.arXiv preprint arXiv:2507.22229, 2025

Stéphane d’Ascoli, Jérémy Rapin, Yohann Benchetrit, Hubert Banville, and Jean-Rémi King. Tribe: Trimodal brain encoder for whole-brain fmri response prediction.arXiv preprint arXiv:2507.22229, 2025

work page arXiv 2025
[52]

Maeeg: Masked auto-encoder for eeg representation learning.arXiv preprint arXiv:2211.02625, 2022

Hsiang-Yun Sherry Chien, Hanlin Goh, Christopher M Sandino, and Joseph Y Cheng. Maeeg: Masked auto-encoder for eeg representation learning.arXiv preprint arXiv:2211.02625, 2022

work page arXiv 2022
[53]

Dreamdif- fusion: High-quality eeg-to-image generation with temporal masked signal modeling and clip alignment

Yunpeng Bai, Xintao Wang, Yan-Pei Cao, Yixiao Ge, Chun Yuan, and Ying Shan. Dreamdif- fusion: High-quality eeg-to-image generation with temporal masked signal modeling and clip alignment. InEuropean Conference on Computer Vision, pages 472–488. Springer, 2024

2024
[54]

Eegmamba: An eeg foundation model with mamba.Neural Networks, page 107816, 2025

Jiquan Wang, Sha Zhao, Zhiling Luo, Yangxuan Zhou, Shijian Li, and Gang Pan. Eegmamba: An eeg foundation model with mamba.Neural Networks, page 107816, 2025

2025
[55]

Neuript: Foundation model for neural interfaces

Zitao Fang, Chenxuan Li, Hongting Zhou, Shuyang Yu, Guodong Du, Ashwaq Qasem, Yang Lu, Jing Li, Junsong Zhang, and Sim Kuan Goh. Neuript: Foundation model for neural interfaces. arXiv preprint arXiv:2510.16548, 2025

work page arXiv 2025
[56]

The standardized eeg electrode array of the ifcn

Margitta Seeck, Laurent Koessler, Thomas Bast, Frans Leijten, Christoph Michel, Christoph Baumgartner, Bin He, and Sándor Beniczky. The standardized eeg electrode array of the ifcn. Clinical neurophysiology, 128(10):2070–2077, 2017

2070
[57]

Hebart, Oliver Contier, Lina Teichmann, Adam H

Martin N. Hebart, Oliver Contier, Lina Teichmann, Adam H. Rockter, Charles Y . Zheng, Alexis Kidder, Anna Corriveau, Maryam Vaziri-Pashkam, and Chris I. Baker. Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior.eLife, 12:e82580, 2023. doi: 10.7554/eLife.82580

work page doi:10.7554/elife.82580 2023
[58]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[59]

Scaling vision transform- ers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform- ers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022

2022
[60]

Chinese clip: Contrastive vision-language pretraining in chinese,

An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335, 2022

work page arXiv 2022
[61]

Kriegeskorte, M

N. Kriegeskorte, M. Mur, and P. Bandettini. Representational similarity analysis—connecting the branches of systems neuroscience.Frontiers in Systems Neuroscience, 2:4, 2008

2008
[62]

Am/eeg-fmri fusion primer: resolving human brain responses in space and time.Neuron, 107(5):772–781, 2020

Radoslaw M Cichy and Aude Oliva. Am/eeg-fmri fusion primer: resolving human brain responses in space and time.Neuron, 107(5):772–781, 2020

2020
[63]

Resolving human object recogni- tion in space and time.Nature neuroscience, 17(3):455–462, 2014

Radoslaw Martin Cichy, Dimitrios Pantazis, and Aude Oliva. Resolving human object recogni- tion in space and time.Nature neuroscience, 17(3):455–462, 2014

2014
[64]

The functional significance of delta oscillations in cognitive processing

Thalía Harmony. The functional significance of delta oscillations in cognitive processing. Frontiers in integrative neuroscience, 7:83, 2013

2013
[65]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 14 A Dataset We evaluate our method on two large-scale benchmarks: Things-EEG2 and Things-MEG. Table 7 provides the detailed information on the two datasets. Things-EEG2 provides 63-channel EEG recordings from 10 participants viewing natural obje...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[66]

All Components

and α=0.1. From the 16,540 training trials, 740 are held out for validation, fixed across runs and seeds. Final predictions average the three checkpoints with the lowest validation loss; all experiments are repeated over 3 seeds. Statistical testing.We assess significance with paired Wilcoxon signed-rank tests over the 10 per- subject scores (two-sided,α=...

[1] [1]

Decoding the brain: From neural representations to mechanistic models

Mackenzie Weygandt Mathis, Adriana Perez Rotondo, Edward F Chang, Andreas S Tolias, and Alexander Mathis. Decoding the brain: From neural representations to mechanistic models. Cell, 187(21):5814–5832, 2024

2024

[2] [2]

Using goal-driven deep learning models to understand sensory cortex.Nature neuroscience, 19(3):356–365, 2016

Daniel LK Yamins and James J DiCarlo. Using goal-driven deep learning models to understand sensory cortex.Nature neuroscience, 19(3):356–365, 2016

2016

[3] [3]

Decoding the visual and subjective contents of the human brain.Nature neuroscience, 8(5):679–685, 2005

Yukiyasu Kamitani and Frank Tong. Decoding the visual and subjective contents of the human brain.Nature neuroscience, 8(5):679–685, 2005

2005

[4] [4]

Identifying natural images from human brain activity.Nature, 452(7185):352–355, 2008

Kendrick N Kay, Thomas Naselaris, Ryan J Prenger, and Jack L Gallant. Identifying natural images from human brain activity.Nature, 452(7185):352–355, 2008

2008

[5] [5]

Visual image reconstruction from human brain activity using a combination of multiscale local image decoders.Neuron, 60(5):915–929, 2008

Yoichi Miyawaki, Hajime Uchida, Okito Yamashita, Masa-aki Sato, Yusuke Morito, Hiroki C Tanabe, Norihiro Sadato, and Yukiyasu Kamitani. Visual image reconstruction from human brain activity using a combination of multiscale local image decoders.Neuron, 60(5):915–929, 2008

2008

[6] [6]

Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014

Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014

2014

[7] [7]

Deep neural networks: a new framework for modeling biological vision and brain information processing.Annual review of vision science, 1:417–446, 2015

Nikolaus Kriegeskorte. Deep neural networks: a new framework for modeling biological vision and brain information processing.Annual review of vision science, 1:417–446, 2015

2015

[8] [8]

Brains and algorithms partially converge in natural language processing.Communications biology, 5(1):134, 2022

Charlotte Caucheteux and Jean-Rémi King. Brains and algorithms partially converge in natural language processing.Communications biology, 5(1):134, 2022

2022

[9] [9]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[10] [10]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

2024

[11] [11]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

Alessandro T Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M Cichy. A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

2022

[13] [13]

Human eeg recordings for 1,854 concepts presented in rapid serial visual presentation streams

Tijl Grootswagers, Ivy Zhou, Amanda K Robinson, Martin N Hebart, and Thomas A Carlson. Human eeg recordings for 1,854 concepts presented in rapid serial visual presentation streams. Scientific Data, 9(1):3, 2022

2022

[14] [14]

Decoding natural images from eeg for object recognition.arXiv preprint arXiv:2308.13234, 2023

Y . Song et al. Decoding natural images from eeg for object recognition.arXiv preprint arXiv:2308.13234, 2023

work page arXiv 2023

[15] [15]

Recognizing natural images from eeg with language-guided contrastive learning.IEEE Transactions on Neural Networks and Learning Systems, 2025

Yonghao Song, Yijun Wang, Huiguang He, and Xiaorong Gao. Recognizing natural images from eeg with language-guided contrastive learning.IEEE Transactions on Neural Networks and Learning Systems, 2025

2025

[16] [16]

Visual decoding and reconstruction via eeg embeddings with guided diffusion, 2024

Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, Haoyang Qin, and Quanying Liu. Vi- sual decoding and reconstruction via eeg embeddings with guided diffusion.arXiv preprint arXiv:2403.07721, 2024

work page arXiv 2024

[17] [17]

Bridging the vision-brain gap with an uncertainty-aware blur prior

Haitao Wu, Qing Li, Changqing Zhang, Zhen He, and Xiaomin Ying. Bridging the vision-brain gap with an uncertainty-aware blur prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2246–2257, 2025. 11

2025

[18] [18]

Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding

Yueyang Li, Zijian Kang, Shengyu Gong, Wenhao Dong, Weiming Zeng, Hongjie Yan, Wai Ting Siok, and Nizhuan Wang. Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

2025

[19] [19]

Mapping human brain function with meg and eeg: methods and validation.NeuroImage, 23:S289–S299, 2004

Felix Darvas, D Pantazis, E Kucukaltun-Yildirim, and RM Leahy. Mapping human brain function with meg and eeg: methods and validation.NeuroImage, 23:S289–S299, 2004

2004

[20] [20]

Classification of eeg signals based on pattern recognition approach

Hafeez Ullah Amin, Wajid Mumtaz, Ahmad Rauf Subhani, Mohamad Naufal Mohamad Saad, and Aamir Saeed Malik. Classification of eeg signals based on pattern recognition approach. Frontiers in computational neuroscience, 11:103, 2017

2017

[21] [21]

A review of issues related to data acquisition and analysis in eeg/meg studies.Brain sciences, 7(6):58, 2017

Aina Puce and Matti S Hämäläinen. A review of issues related to data acquisition and analysis in eeg/meg studies.Brain sciences, 7(6):58, 2017

2017

[22] [22]

A common, high-dimensional model of the representational space in human ventral temporal cortex.Neuron, 72(2):404–416, 2011

James V Haxby, J Swaroop Guntupalli, Andrew C Connolly, Yaroslav O Halchenko, Bryan R Conroy, M Ida Gobbini, Michael Hanke, and Peter J Ramadge. A common, high-dimensional model of the representational space in human ventral temporal cortex.Neuron, 72(2):404–416, 2011

2011

[23] [23]

Changde Du, Kaicheng Fu, Jinpeng Li, and Huiguang He. Decoding visual neural representa- tions by multimodal learning of brain-visual-linguistic features.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10760–10777, 2023

2023

[24] [24]

Reve: A foundation model for eeg–adapting to any setup with large-scale pretraining on 25,000 subjects.arXiv preprint arXiv:2510.21585, 2025

Yassine El Ouahidi, Jonathan Lys, Philipp Thölke, Nicolas Farrugia, Bastien Pasdeloup, Vincent Gripon, Karim Jerbi, and Giulia Lioi. Reve: A foundation model for eeg–adapting to any setup with large-scale pretraining on 25,000 subjects.arXiv preprint arXiv:2510.21585, 2025

work page arXiv 2025

[25] [25]

Thd-bar: Topology hierarchical derived brain autoregressive modeling for eeg generic representations.arXiv preprint arXiv:2511.13733, 2025

Wenchao Yang, Weidong Yan, Wenkang Liu, Yulan Ma, and Yang Li. Thd-bar: Topology hierarchical derived brain autoregressive modeling for eeg generic representations.arXiv preprint arXiv:2511.13733, 2025

work page arXiv 2025

[26] [26]

Spiced: A synaptic homeostasis-inspired framework for unsupervised continual eeg decoding

Yangxuan Zhou, Sha Zhao, Jiquan Wang, Haiteng Jiang, Shijian Li, Tao Li, and Gang Pan. Spiced: A synaptic homeostasis-inspired framework for unsupervised continual eeg decoding. arXiv preprint arXiv:2509.17439, 2025

work page arXiv 2025

[27] [27]

Encoding and decoding in fmri.Neuroimage, 56(2):400–410, 2011

Thomas Naselaris, Kendrick N Kay, Shinji Nishimoto, and Jack L Gallant. Encoding and decoding in fmri.Neuroimage, 56(2):400–410, 2011

2011

[28] [28]

Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532 (7600):453–458, 2016

Alexander G Huth, Wendy A De Heer, Thomas L Griffiths, Frédéric E Theunissen, and Jack L Gallant. Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532 (7600):453–458, 2016

2016

[29] [29]

Brain-score: Which artificial neural network for object recognition is most brain-like?BioRxiv, page 407007, 2018

Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J Majaj, Rishi Rajalingham, Elias B Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Franziska Geiger, et al. Brain-score: Which artificial neural network for object recognition is most brain-like?BioRxiv, page 407007, 2018

2018

[30] [30]

Deep learning human mind for automated visual classification

Concetto Spampinato, Simone Palazzo, Isaak Kavasidis, Daniela Giordano, Nasim Souly, and Mubarak Shah. Deep learning human mind for automated visual classification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6809–6817, 2017

2017

[31] [31]

Brain decoding: toward real-time reconstruction of visual perception.arXiv preprint arXiv:2310.19812, 2023

Yohann Benchetrit, Hubert Banville, and Jean-Rémi King. Brain decoding: toward real-time reconstruction of visual perception.arXiv preprint arXiv:2310.19812, 2023

work page arXiv 2023

[32] [32]

Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

Paul S Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

work page arXiv 2024

[33] [33]

Mindbridge: A cross-subject brain decoding framework

Shizun Wang, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Mindbridge: A cross-subject brain decoding framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11333–11342, 2024. 12

2024

[34] [34]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023

[35] [35]

Enigma: A unified lightweight eeg-to-image model for multi-subject visual decoding

Reese Kneeland, Wangshu Jiang, Ugo Bruzadin Nunes, Si Kai Lee, Paul Steven Scotti, Arnaud Delorme, and Jonathan Xu. Enigma: A unified lightweight eeg-to-image model for multi-subject visual decoding. InNeurIPS 2025 Workshop on Foundation Models for the Brain and Body, 2025

2025

[36] [36]

Deep learning with convolutional neural networks for eeg decoding and visualization

Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wolfram Burgard, and Tonio Ball. Deep learning with convolutional neural networks for eeg decoding and visualization. Human brain mapping, 38(11):5391–5420, 2017

2017

[37] [37]

Lstm-based eeg classification in motor imagery tasks.IEEE transactions on neural systems and rehabilitation engineering, 26(11):2086–2095, 2018

Ping Wang, Aimin Jiang, Xiaofeng Liu, Jing Shang, and Li Zhang. Lstm-based eeg classification in motor imagery tasks.IEEE transactions on neural systems and rehabilitation engineering, 26(11):2086–2095, 2018

2086

[38] [38]

Eeg-based emotion recognition using regularized graph neural networks.IEEE Transactions on Affective Computing, 13(3):1290–1301, 2020

Peixiang Zhong, Di Wang, and Chunyan Miao. Eeg-based emotion recognition using regularized graph neural networks.IEEE Transactions on Affective Computing, 13(3):1290–1301, 2020

2020

[39] [39]

Eeg-gnn: Graph neural networks for classification of electroencephalogram (eeg) signals

Andac Demir, Toshiaki Koike-Akino, Ye Wang, Masaki Haruna, and Deniz Erdogmus. Eeg-gnn: Graph neural networks for classification of electroencephalogram (eeg) signals. In2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 1061–1067. IEEE, 2021

2021

[40] [40]

Graph Attention Networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks.arXiv preprint arXiv:1710.10903, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

How Attentive are Graph Attention Networks?

Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks?arXiv preprint arXiv:2105.14491, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[42] [42]

Predicting brain activity using trans- formers.bioRxiv, pages 2023–08, 2023

Hossein Adeli, Sun Minni, and Nikolaus Kriegeskorte. Predicting brain activity using trans- formers.bioRxiv, pages 2023–08, 2023

2023

[43] [43]

The Wisdom of a Crowd of Brains: A Universal Brain Encoder

Roman Beliy, Navve Wasserman, Amit Zalcher, and Michal Irani. The wisdom of a crowd of brains: A universal brain encoder.arXiv preprint arXiv:2406.12179, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Neuro-3d: Towards 3d visual decoding from eeg signals

Zhanqiang Guo, Jiamin Wu, Yonghao Song, Jiahui Bu, Weijian Mai, Qihao Zheng, Wanli Ouyang, and Chunfeng Song. Neuro-3d: Towards 3d visual decoding from eeg signals. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23870–23880, 2025

2025

[45] [45]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022

[46] [46]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

2019

[47] [47]

Meta-learning in neural networks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44 (9):5149–5169, 2021

Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44 (9):5149–5169, 2021

2021

[48] [48]

Learnable latent embeddings for joint behavioural and neural analysis.Nature, 617(7960):360–368, 2023

Steffen Schneider, Jin Hwa Lee, and Mackenzie Weygandt Mathis. Learnable latent embeddings for joint behavioural and neural analysis.Nature, 617(7960):360–368, 2023

2023

[49] [49]

Foundation model of neural activity predicts response to new stimulus types.Nature, 640(8058):470–477, 2025

Eric Y Wang, Paul G Fahey, Zhuokun Ding, Stelios Papadopoulos, Kayla Ponder, Marissa A Weis, Andersen Chang, Taliah Muhammad, Saumil Patel, Zhiwei Ding, et al. Foundation model of neural activity predicts response to new stimulus types.Nature, 640(8058):470–477, 2025

2025

[50] [50]

A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025

Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian Coda- Forno, Peter Dayan, Can Demircan, Maria K Eckstein, Noémi Éltet˝o, et al. A foundation model to predict and capture human cognition.Nature, 644(8078):1002–1009, 2025. 13

2025

[51] [51]

Tribe: Trimodal brain encoder for whole-brain fmri response prediction.arXiv preprint arXiv:2507.22229, 2025

Stéphane d’Ascoli, Jérémy Rapin, Yohann Benchetrit, Hubert Banville, and Jean-Rémi King. Tribe: Trimodal brain encoder for whole-brain fmri response prediction.arXiv preprint arXiv:2507.22229, 2025

work page arXiv 2025

[52] [52]

Maeeg: Masked auto-encoder for eeg representation learning.arXiv preprint arXiv:2211.02625, 2022

Hsiang-Yun Sherry Chien, Hanlin Goh, Christopher M Sandino, and Joseph Y Cheng. Maeeg: Masked auto-encoder for eeg representation learning.arXiv preprint arXiv:2211.02625, 2022

work page arXiv 2022

[53] [53]

Dreamdif- fusion: High-quality eeg-to-image generation with temporal masked signal modeling and clip alignment

Yunpeng Bai, Xintao Wang, Yan-Pei Cao, Yixiao Ge, Chun Yuan, and Ying Shan. Dreamdif- fusion: High-quality eeg-to-image generation with temporal masked signal modeling and clip alignment. InEuropean Conference on Computer Vision, pages 472–488. Springer, 2024

2024

[54] [54]

Eegmamba: An eeg foundation model with mamba.Neural Networks, page 107816, 2025

Jiquan Wang, Sha Zhao, Zhiling Luo, Yangxuan Zhou, Shijian Li, and Gang Pan. Eegmamba: An eeg foundation model with mamba.Neural Networks, page 107816, 2025

2025

[55] [55]

Neuript: Foundation model for neural interfaces

Zitao Fang, Chenxuan Li, Hongting Zhou, Shuyang Yu, Guodong Du, Ashwaq Qasem, Yang Lu, Jing Li, Junsong Zhang, and Sim Kuan Goh. Neuript: Foundation model for neural interfaces. arXiv preprint arXiv:2510.16548, 2025

work page arXiv 2025

[56] [56]

The standardized eeg electrode array of the ifcn

Margitta Seeck, Laurent Koessler, Thomas Bast, Frans Leijten, Christoph Michel, Christoph Baumgartner, Bin He, and Sándor Beniczky. The standardized eeg electrode array of the ifcn. Clinical neurophysiology, 128(10):2070–2077, 2017

2070

[57] [57]

Hebart, Oliver Contier, Lina Teichmann, Adam H

Martin N. Hebart, Oliver Contier, Lina Teichmann, Adam H. Rockter, Charles Y . Zheng, Alexis Kidder, Anna Corriveau, Maryam Vaziri-Pashkam, and Chris I. Baker. Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior.eLife, 12:e82580, 2023. doi: 10.7554/eLife.82580

work page doi:10.7554/elife.82580 2023

[58] [58]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[59] [59]

Scaling vision transform- ers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform- ers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022

2022

[60] [60]

Chinese clip: Contrastive vision-language pretraining in chinese,

An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335, 2022

work page arXiv 2022

[61] [61]

Kriegeskorte, M

N. Kriegeskorte, M. Mur, and P. Bandettini. Representational similarity analysis—connecting the branches of systems neuroscience.Frontiers in Systems Neuroscience, 2:4, 2008

2008

[62] [62]

Am/eeg-fmri fusion primer: resolving human brain responses in space and time.Neuron, 107(5):772–781, 2020

Radoslaw M Cichy and Aude Oliva. Am/eeg-fmri fusion primer: resolving human brain responses in space and time.Neuron, 107(5):772–781, 2020

2020

[63] [63]

Resolving human object recogni- tion in space and time.Nature neuroscience, 17(3):455–462, 2014

Radoslaw Martin Cichy, Dimitrios Pantazis, and Aude Oliva. Resolving human object recogni- tion in space and time.Nature neuroscience, 17(3):455–462, 2014

2014

[64] [64]

The functional significance of delta oscillations in cognitive processing

Thalía Harmony. The functional significance of delta oscillations in cognitive processing. Frontiers in integrative neuroscience, 7:83, 2013

2013

[65] [65]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 14 A Dataset We evaluate our method on two large-scale benchmarks: Things-EEG2 and Things-MEG. Table 7 provides the detailed information on the two datasets. Things-EEG2 provides 63-channel EEG recordings from 10 participants viewing natural obje...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[66] [66]

All Components

and α=0.1. From the 16,540 training trials, 740 are held out for validation, fixed across runs and seeds. Final predictions average the three checkpoints with the lowest validation loss; all experiments are repeated over 3 seeds. Statistical testing.We assess significance with paired Wilcoxon signed-rank tests over the 10 per- subject scores (two-sided,α=...