Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs
Pith reviewed 2026-05-20 10:20 UTC · model grok-4.3
The pith
Generating proxy images from EEG signals lets MLLMs use visual priors to interpret brain activity more effectively than text alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generative Visual Grounding employs an EEG-to-image generative model as a visual translator to produce instance-specific proxy images for non-visual EEG. These proxies supply structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation, delivering competitive results with image-only alignment and consistent improvements when extended to trimodal image-plus-text alignment.
What carries the argument
Generative Visual Grounding (GVG), the framework that uses an EEG-to-image generative model to create instance-specific proxy images serving as visual contexts for MLLM alignment.
If this is right
- Image-only alignment using the generated proxies matches the performance of larger text-aligned baselines while tuning only a small fraction of parameters on a frozen backbone.
- Trimodal alignment that adds the visual proxies to text supplies both categorical semantic anchors and perceptual details for richer neural representations.
- The method produces measurable gains in EEG understanding tasks as well as in visual generation from brain signals.
- Visual proxy grounding functions as a direct complement to textual alignment for building more capable EEG foundation models.
Where Pith is reading between the lines
- Similar proxy generation could extend visual grounding to other non-visual sensor data such as audio or wearable signals.
- The approach may support more interpretable brain-computer interfaces by linking raw neural activity to concrete visual outputs users can inspect.
- Testing whether the generated images recover specific perceptual experiences encoded in EEG would provide a direct check on information preservation.
- Combining this grounding with other modalities could produce more robust multimodal models for scarce brain-signal datasets.
Load-bearing premise
EEG-to-image generative models can accurately translate neural signals into meaningful visual representations that preserve fine-grained perceptual information without introducing misleading artifacts.
What would settle it
A controlled experiment showing that MLLMs achieve equal or lower accuracy on clinical-state prediction tasks when given the generated proxy images versus text-only alignments would falsify the central claim.
Figures
read the original abstract
Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, visually-evoked EEG datasets remain scarce, leading existing methods to align neural signals mainly with abstract text, a lossy translation that may discard fine-grained perceptual information encoded in brain activity. We propose Generative Visual Grounding (GVG), a framework that visualizes the invisible by using an EEG-to-image generative model as a visual translator. Instead of forcing EEG into text alone, GVG hallucinates instance-specific proxy images for non-visual EEG, providing structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation. We validate this idea on two MLLM backbones, GVG-X-Omni and GVG-Janus. Image-only alignment is already competitive: the lightweight GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. We further extend GVG-Janus with trimodal Image+Text alignment, where text supplies categorical semantic anchors and visual proxies enrich neural representations with perceptual details. Experiments show consistent gains in EEG understanding and visual generation, suggesting visual proxy grounding as an effective complement to textual alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Generative Visual Grounding (GVG), a framework that uses an EEG-to-image generative model to hallucinate instance-specific proxy images from non-visual EEG signals. These proxies supply structured visual context to MLLMs, enabling them to leverage visual priors for clinical-state interpretation instead of relying solely on lossy text alignment. The approach is validated on two backbones (GVG-X-Omni and GVG-Janus), with claims that image-only alignment is competitive with larger text baselines using only 170M tunable parameters on a frozen 7B model, and that trimodal (Image+Text) alignment yields further gains in EEG understanding and visual generation.
Significance. If the generated visual proxies faithfully encode fine-grained perceptual details from EEG without introducing artifacts, the framework could meaningfully advance brain foundation models by complementing textual alignment with visual priors in MLLMs. The parameter-efficient tuning (170M parameters) and the explicit separation of categorical semantic anchors (text) from perceptual enrichment (images) are strengths. However, the absence of direct fidelity metrics or controls for non-visual EEG cases limits the assessed impact, as gains might stem from added modality capacity rather than meaningful neural-to-visual translation.
major comments (3)
- [Abstract / Experiments] Abstract and validation sections: The central claim that visual proxies 'enrich neural representations with perceptual details' and enable 'consistent gains' requires evidence that EEG-to-image outputs preserve fine-grained information rather than spurious features. No direct fidelity checks, image quality metrics, or comparisons against ground-truth perceptual content for non-visual EEG are described, leaving open whether reported improvements track proxy quality or simply reflect extra input capacity.
- [Validation on GVG-X-Omni] GVG-X-Omni description: The claim that the lightweight model 'matches 1.7B-parameter text-aligned baselines' while tuning only 170M parameters on a frozen 7B backbone is load-bearing for the efficiency argument, yet no specific baseline models, datasets, tasks, or numerical performance values (e.g., accuracy, F1) are provided to support the comparison.
- [GVG-Janus trimodal alignment] Trimodal extension: Extending GVG-Janus with Image+Text alignment is presented as yielding further gains, but without ablation isolating the contribution of the generated visual proxies versus text alone, or versus random visual inputs, it is unclear whether the perceptual enrichment is the operative factor.
minor comments (2)
- [Abstract] The abstract uses 'hallucinates' to describe the generative process; a more neutral term such as 'generates' would avoid unintended connotations in a scientific context.
- [Methods] Notation for the two backbones (GVG-X-Omni, GVG-Janus) is introduced without an explicit definition of how GVG is integrated into each architecture.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, clarifying our approach and outlining revisions to strengthen the evidence and presentation.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and validation sections: The central claim that visual proxies 'enrich neural representations with perceptual details' and enable 'consistent gains' requires evidence that EEG-to-image outputs preserve fine-grained information rather than spurious features. No direct fidelity checks, image quality metrics, or comparisons against ground-truth perceptual content for non-visual EEG are described, leaving open whether reported improvements track proxy quality or simply reflect extra input capacity.
Authors: We agree that direct fidelity evidence would be ideal. However, non-visual EEG inherently lacks ground-truth images, rendering standard metrics such as FID or LPIPS inapplicable without artificial references. Our primary validation relies on consistent downstream gains in EEG understanding and generation tasks, which serve as indirect but task-relevant indicators that the proxies capture meaningful perceptual structure rather than noise. In revision we will add a dedicated subsection discussing evaluation challenges for non-visual signals, include qualitative examples of generated proxies with corresponding model attention maps, and report correlation analysis between proxy characteristics and task performance to better address this concern. revision: yes
-
Referee: [Validation on GVG-X-Omni] GVG-X-Omni description: The claim that the lightweight model 'matches 1.7B-parameter text-aligned baselines' while tuning only 170M parameters on a frozen 7B backbone is load-bearing for the efficiency argument, yet no specific baseline models, datasets, tasks, or numerical performance values (e.g., accuracy, F1) are provided to support the comparison.
Authors: The experimental section of the full manuscript contains these comparisons, but we acknowledge that the high-level claim in the abstract and introduction would benefit from explicit anchoring. In the revised manuscript we will insert a concise table or paragraph that names the specific 1.7B-parameter text-aligned baselines, lists the EEG datasets and clinical interpretation tasks used, and reports the numerical results (accuracy and F1 scores) demonstrating that GVG-X-Omni remains competitive while tuning only 170M parameters on the frozen 7B backbone. revision: yes
-
Referee: [GVG-Janus trimodal alignment] Trimodal extension: Extending GVG-Janus with Image+Text alignment is presented as yielding further gains, but without ablation isolating the contribution of the generated visual proxies versus text alone, or versus random visual inputs, it is unclear whether the perceptual enrichment is the operative factor.
Authors: We have already compared text-only, image-only, and trimodal alignments and observed incremental gains for the trimodal setting. To more rigorously isolate the role of the generated proxies, we will add a new ablation experiment in the revision that replaces the EEG-conditioned proxies with random or noise-based images while keeping all other factors fixed. This control will clarify whether the observed improvements stem from semantically relevant visual content rather than simply the addition of an extra modality. revision: yes
Circularity Check
No circularity: new framework proposal validated via independent experiments
full rationale
The paper proposes Generative Visual Grounding (GVG) as a method that uses an EEG-to-image generative model to create instance-specific visual proxies for non-visual EEG signals, which are then fed into MLLMs for improved clinical-state interpretation. The derivation consists of describing this translator role, applying it to two specific backbones (GVG-X-Omni with 170M tunable parameters on a frozen 7B model, and trimodal GVG-Janus), and reporting empirical gains in alignment and generation tasks. No equations, fitted parameters renamed as predictions, or self-citation chains are invoked to force the central claims; the results are presented as outcomes of external validation on GVG-X-Omni and GVG-Janus rather than reducing tautologically to the inputs by construction. The approach remains self-contained against the described benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption EEG signals contain fine-grained perceptual information that can be translated into instance-specific visual images via generative models
invented entities (1)
-
Generative Visual Grounding (GVG) framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We employ an EEG-to-Image generative model (AVDE) as a visual translator to hallucinate instance-specific proxy images for non-visual EEG data... trimodal objective Ltri = λ_ei L_ei + λ_et L_et + λ_it L_it
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mapping raw EEG signals into discrete image tokens... similarity-based prediction over codebook V
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Diego Alvarez-Estevez and Roselyne M Rijsman. 2021. Inter-database validation of a deep learning approach for automatic sleep scoring.PloS one16, 8 (2021), e0256111
work page 2021
-
[3]
Yunpeng Bai, Xintao Wang, Yan-pei Cao, Yixiao Ge, Chun Yuan, and Ying Shan
-
[4]
Dr eamDif- fusion: Generating high-quality images from brain EEG sign als,
Dreamdiffusion: Generating high-quality images from brain eeg signals. arXiv preprint arXiv:2306.16934(2023)
- [5]
-
[6]
Donghong Cai, Junru Chen, Yang Yang, Teng Liu, and Yafeng Li. 2023. Mbrain: A multi-channel self-supervised learning framework for brain signals. InPro- ceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 130–141
work page 2023
-
[7]
Josue Ortega Caro, Antonio H de O Fonseca, Christopher Averill, Syed A Rizvi, Matteo Rosati, James L Cross, Prateek Mittal, Emanuele Zappala, Daniel Levine, Rahul M Dhodapkar, et al. 2023. BrainLM: A foundation model for brain activity recordings.bioRxiv(2023), 2023–09
work page 2023
-
[8]
Xuhang Chen, Baiying Lei, Chi-Man Pun, and Shuqiang Wang. 2023. Brain diffuser: An end-to-end brain image to brain network pipeline. InChinese Con- ference on Pattern Recognition and Computer Vision (PRCV). Springer, 16–26
work page 2023
-
[9]
Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. 2023. Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Infor- mation Processing Systems36 (2023), 24841–24858
work page 2023
- [10]
-
[11]
Wenhui Cui, Woojae Jeong, Philipp Thölke, Takfarinas Medani, Karim Jerbi, Anand A Joshi, and Richard M Leahy. 2024. Neuro-gpt: Towards a foundation model for eeg. In2024 IEEE International Symposium on Biomedical Imaging (ISBI). IEEE, 1–5
work page 2024
- [12]
- [13]
-
[14]
Ruo-Nan Duan, Jia-Yi Zhu, and Bao-Liang Lu. 2013. Differential entropy feature for EEG-based emotion classification. In6th International IEEE/EMBS Conference on Neural Engineering (NER). IEEE, 81–84
work page 2013
- [15]
- [16]
-
[17]
Amir Harati, Meysam Golmohammadi, Silvia Lopez, Iyad Obeid, and Joseph Picone. 2015. Improved EEG event classification using differential energy. In 2015 IEEE Signal Processing in Medicine and Biology Symposium (SPMB). IEEE, 1–4
work page 2015
-
[18]
Shuai Huang, Yongxiong Wang, Huan Luo, Haodong Jing, Chendong Qin, and Jingqun Tang. 2025. MINDEV: Multi-modal Integrated Diffusion Framework for Video Reconstruction from EEG Signals. InProceedings of the 33rd ACM International Conference on Multimedia. 3350–3359
work page 2025
-
[19]
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. 2024. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Wei-Bang Jiang, Xuan-Hao Liu, Wei-Long Zheng, and Bao-Liang Lu. 2025. SEED- VII: A Multimodal Dataset of Six Basic Emotions With Continuous Labels for Emotion Recognition.IEEE Transactions on Affective Computing16, 2 (2025), 969–985. doi:10.1109/TAFFC.2024.3485057
- [21]
- [22]
-
[23]
Jin Jing, Wendong Ge, Shenda Hong, Marta Bento Fernandes, Zhen Lin, Chaoqi Yang, Sungtae An, Aaron F Struck, Aline Herlopian, Ioannis Karakis, et al. 2023. Development of expert-level classification of seizures and rhythmic and periodic patterns during EEG interpretation.Neurology100, 17 (2023), e1750–e1762
work page 2023
-
[24]
Isaak Kavasidis, Simone Palazzo, Concetto Spampinato, Daniela Giordano, and Mubarak Shah. 2017. Brain2image: Converting brain signals into images. In Proceedings of the 25th ACM international conference on Multimedia. 1809–1817
work page 2017
- [25]
-
[26]
Demetres Kostas, Stephane Aroca-Ouellette, and Frank Rudzicz. 2021. BENDR: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data.Frontiers in Human Neuroscience15 (2021), 653659
work page 2021
-
[27]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. 2025. FLUX.1 Kontext: Flow Matching for In-Context ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [28]
-
[29]
Hongli Li, Man Ding, Ronghua Zhang, and Chunbo Xiu. 2022. Motor imagery EEG classification algorithm based on CNN-LSTM feature fusion network.Biomedical signal processing and control72 (2022), 103342
work page 2022
- [30]
-
[31]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916
work page 2023
-
[32]
Xuan-Hao Liu, Yan-Kai Liu, Yansen Wang, Kan Ren, Hanwen Shi, Zilong Wang, Dongsheng Li, Bao-Liang Lu, and Wei-Long Zheng. 2024. EEG2video: Towards decoding dynamic visual perception from EEG signals.Advances in Neural Information Processing Systems37 (2024), 72245–72273
work page 2024
-
[33]
Xuan-Hao Liu, Bao-Liang Lu, and Wei-Long Zheng. 2025. Eegmirror: Leveraging eeg data in the wild via montage-agnostic self-supervision for eeg to video decoding. InProceedings of the IEEE/CVF International Conference on Computer Vision. 18273–18283
work page 2025
-
[34]
Weiheng Lu, Chunfeng Song, Jiamin Wu, Pengyu Zhu, Yuchen Zhou, Weijian Mai, Qihao Zheng, and Wanli Ouyang. 2025. UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding.arXiv preprint arXiv:2506.18962 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Fei Ma, Han Lin, Yifan Xie, Hongwei Ren, Xiaoyu Shen, Wenbo Ding, and Qi Tian
-
[36]
arXiv preprint arXiv:2601.07877(2026)
Eˆ 2-LLM: Bridging Neural Signals and Interpretable Affective Analysis. arXiv preprint arXiv:2601.07877(2026)
-
[37]
Wei Yan Peh, Yuanyuan Yao, and Justin Dauwels. 2022. Transformer convolu- tional neural networks for automated artifact detection in scalp EEG. In2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 3599–3602
work page 2022
-
[38]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al
-
[39]
In International conference on machine learning
Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763
- [40]
-
[41]
Concetto Spampinato, Simone Palazzo, Isaak Kavasidis, Daniela Giordano, Nasim Souly, and Mubarak Shah. 2017. Deep learning human mind for automated visual classification. InProceedings of the IEEE conference on computer vision and pattern recognition. 6809–6817
work page 2017
-
[42]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. 2024. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems37 (2024), 84839–84865
work page 2024
-
[44]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. 2025. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [45]
-
[46]
Guangyu Wang, Wenchao Liu, Yuhong He, Cong Xu, Lin Ma, and Haifeng Li
-
[47]
Eegpt: Pretrained transformer for universal and reliable representation of eeg signals.Advances in Neural Information Processing Systems37 (2024), 39249–39280. 9 Pan et al
work page 2024
- [48]
-
[49]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. 2024. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Cheng- peng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Chaoqi Yang, M Westover, and Jimeng Sun. 2023. Biot: Biosignal transformer for cross-data learning in the wild.Advances in Neural Information Processing Systems36 (2023), 78240–78260
work page 2023
-
[53]
Chaoqi Yang, Cao Xiao, M Brandon Westover, and Jimeng Sun. 2023. Self- supervised electroencephalogram representation learning for automatic sleep staging: model development and evaluation study.JMIR AI2, 1 (2023), e46769
work page 2023
- [54]
-
[55]
Ke Yi, Yansen Wang, Kan Ren, and Dongsheng Li. 2023. Learning topology- agnostic EEG representations with geometry-aware modeling.Advances in Neural Information Processing Systems36 (2023), 53875–53891
work page 2023
-
[56]
Zhizhang Yuan, Fanqi Shen, Meng Li, Yuguo Yu, Chenhao Tan, and Yang Yang
- [57]
- [58]
-
[59]
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision. 11975–11986
work page 2023
-
[60]
Daoze Zhang, Zhizhang Yuan, Yang Yang, Junru Chen, Jingjing Wang, and Yafeng Li. 2023. Brant: Foundation model for intracranial neural signal.Advances in Neural Information Processing Systems36 (2023), 26304–26321
work page 2023
-
[61]
Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. 2022. Self-supervised contrastive pre-training for time series via time-frequency con- sistency.Advances in neural information processing systems35 (2022), 3988–4003
work page 2022
-
[62]
W. Zheng, W. Liu, Y. Lu, B. Lu, and A. Cichocki. 2018. EmotionMeter: A Mul- timodal Framework for Recognizing Human Emotions.IEEE Transactions on Cybernetics(2018), 1–13. doi:10.1109/TCYB.2018.2797176
-
[63]
Wei-Long Zheng and Bao-Liang Lu. 2015. Investigating Critical Frequency Bands and Channels for EEG-based Emotion Recognition with Deep Neural Networks.IEEE Transactions on Autonomous Mental Development7, 3 (2015), 162–175. doi:10.1109/TAMD.2015.2431497 10
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.