pith. sign in

arxiv: 2605.18172 · v1 · pith:H3T5GARMnew · submitted 2026-05-18 · 💻 cs.AI

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

Pith reviewed 2026-05-20 10:20 UTC · model grok-4.3

classification 💻 cs.AI
keywords generative visual groundingEEG understandingmultimodal large language modelsproxy imagesvisual alignmentbrain signalsclinical interpretationneural representations
0
0 comments X

The pith

Generating proxy images from EEG signals lets MLLMs use visual priors to interpret brain activity more effectively than text alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Generative Visual Grounding to overcome limited visually-evoked EEG data by turning neural signals into instance-specific images. Rather than mapping brain activity only to abstract text, which risks losing perceptual details, the framework uses an EEG-to-image model to create visual proxies. These images supply structured contexts that let multimodal large language models draw on their existing visual knowledge for clinical interpretation tasks. Tests on two backbones show image-only alignment already competes with larger text-based systems while tuning far fewer parameters, and combining images with text yields further gains in understanding and generation. If correct, the work points toward brain foundation models that retain richer information from raw neural signals.

Core claim

Generative Visual Grounding employs an EEG-to-image generative model as a visual translator to produce instance-specific proxy images for non-visual EEG. These proxies supply structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation, delivering competitive results with image-only alignment and consistent improvements when extended to trimodal image-plus-text alignment.

What carries the argument

Generative Visual Grounding (GVG), the framework that uses an EEG-to-image generative model to create instance-specific proxy images serving as visual contexts for MLLM alignment.

If this is right

  • Image-only alignment using the generated proxies matches the performance of larger text-aligned baselines while tuning only a small fraction of parameters on a frozen backbone.
  • Trimodal alignment that adds the visual proxies to text supplies both categorical semantic anchors and perceptual details for richer neural representations.
  • The method produces measurable gains in EEG understanding tasks as well as in visual generation from brain signals.
  • Visual proxy grounding functions as a direct complement to textual alignment for building more capable EEG foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar proxy generation could extend visual grounding to other non-visual sensor data such as audio or wearable signals.
  • The approach may support more interpretable brain-computer interfaces by linking raw neural activity to concrete visual outputs users can inspect.
  • Testing whether the generated images recover specific perceptual experiences encoded in EEG would provide a direct check on information preservation.
  • Combining this grounding with other modalities could produce more robust multimodal models for scarce brain-signal datasets.

Load-bearing premise

EEG-to-image generative models can accurately translate neural signals into meaningful visual representations that preserve fine-grained perceptual information without introducing misleading artifacts.

What would settle it

A controlled experiment showing that MLLMs achieve equal or lower accuracy on clinical-state prediction tasks when given the generated proxy images versus text-only alignments would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.18172 by Baoliang Lu, Dongsheng Li, Enze Zhang, Junyu Pan, Weilong Zheng, Yansen Wang.

Figure 1
Figure 1. Figure 1: Overview of our core idea and proxy-image strategy. Left: GVG converts EEG into a visual-like language, allowing [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Generative Visual Grounding (GVG) Training Framework. The proposed GVG pipeline consists of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Results of EEG-based Visual Reconstruction. We visualize the decoding capabilities of our two instantiations. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, visually-evoked EEG datasets remain scarce, leading existing methods to align neural signals mainly with abstract text, a lossy translation that may discard fine-grained perceptual information encoded in brain activity. We propose Generative Visual Grounding (GVG), a framework that visualizes the invisible by using an EEG-to-image generative model as a visual translator. Instead of forcing EEG into text alone, GVG hallucinates instance-specific proxy images for non-visual EEG, providing structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation. We validate this idea on two MLLM backbones, GVG-X-Omni and GVG-Janus. Image-only alignment is already competitive: the lightweight GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. We further extend GVG-Janus with trimodal Image+Text alignment, where text supplies categorical semantic anchors and visual proxies enrich neural representations with perceptual details. Experiments show consistent gains in EEG understanding and visual generation, suggesting visual proxy grounding as an effective complement to textual alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Generative Visual Grounding (GVG), a framework that uses an EEG-to-image generative model to hallucinate instance-specific proxy images from non-visual EEG signals. These proxies supply structured visual context to MLLMs, enabling them to leverage visual priors for clinical-state interpretation instead of relying solely on lossy text alignment. The approach is validated on two backbones (GVG-X-Omni and GVG-Janus), with claims that image-only alignment is competitive with larger text baselines using only 170M tunable parameters on a frozen 7B model, and that trimodal (Image+Text) alignment yields further gains in EEG understanding and visual generation.

Significance. If the generated visual proxies faithfully encode fine-grained perceptual details from EEG without introducing artifacts, the framework could meaningfully advance brain foundation models by complementing textual alignment with visual priors in MLLMs. The parameter-efficient tuning (170M parameters) and the explicit separation of categorical semantic anchors (text) from perceptual enrichment (images) are strengths. However, the absence of direct fidelity metrics or controls for non-visual EEG cases limits the assessed impact, as gains might stem from added modality capacity rather than meaningful neural-to-visual translation.

major comments (3)
  1. [Abstract / Experiments] Abstract and validation sections: The central claim that visual proxies 'enrich neural representations with perceptual details' and enable 'consistent gains' requires evidence that EEG-to-image outputs preserve fine-grained information rather than spurious features. No direct fidelity checks, image quality metrics, or comparisons against ground-truth perceptual content for non-visual EEG are described, leaving open whether reported improvements track proxy quality or simply reflect extra input capacity.
  2. [Validation on GVG-X-Omni] GVG-X-Omni description: The claim that the lightweight model 'matches 1.7B-parameter text-aligned baselines' while tuning only 170M parameters on a frozen 7B backbone is load-bearing for the efficiency argument, yet no specific baseline models, datasets, tasks, or numerical performance values (e.g., accuracy, F1) are provided to support the comparison.
  3. [GVG-Janus trimodal alignment] Trimodal extension: Extending GVG-Janus with Image+Text alignment is presented as yielding further gains, but without ablation isolating the contribution of the generated visual proxies versus text alone, or versus random visual inputs, it is unclear whether the perceptual enrichment is the operative factor.
minor comments (2)
  1. [Abstract] The abstract uses 'hallucinates' to describe the generative process; a more neutral term such as 'generates' would avoid unintended connotations in a scientific context.
  2. [Methods] Notation for the two backbones (GVG-X-Omni, GVG-Janus) is introduced without an explicit definition of how GVG is integrated into each architecture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, clarifying our approach and outlining revisions to strengthen the evidence and presentation.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and validation sections: The central claim that visual proxies 'enrich neural representations with perceptual details' and enable 'consistent gains' requires evidence that EEG-to-image outputs preserve fine-grained information rather than spurious features. No direct fidelity checks, image quality metrics, or comparisons against ground-truth perceptual content for non-visual EEG are described, leaving open whether reported improvements track proxy quality or simply reflect extra input capacity.

    Authors: We agree that direct fidelity evidence would be ideal. However, non-visual EEG inherently lacks ground-truth images, rendering standard metrics such as FID or LPIPS inapplicable without artificial references. Our primary validation relies on consistent downstream gains in EEG understanding and generation tasks, which serve as indirect but task-relevant indicators that the proxies capture meaningful perceptual structure rather than noise. In revision we will add a dedicated subsection discussing evaluation challenges for non-visual signals, include qualitative examples of generated proxies with corresponding model attention maps, and report correlation analysis between proxy characteristics and task performance to better address this concern. revision: yes

  2. Referee: [Validation on GVG-X-Omni] GVG-X-Omni description: The claim that the lightweight model 'matches 1.7B-parameter text-aligned baselines' while tuning only 170M parameters on a frozen 7B backbone is load-bearing for the efficiency argument, yet no specific baseline models, datasets, tasks, or numerical performance values (e.g., accuracy, F1) are provided to support the comparison.

    Authors: The experimental section of the full manuscript contains these comparisons, but we acknowledge that the high-level claim in the abstract and introduction would benefit from explicit anchoring. In the revised manuscript we will insert a concise table or paragraph that names the specific 1.7B-parameter text-aligned baselines, lists the EEG datasets and clinical interpretation tasks used, and reports the numerical results (accuracy and F1 scores) demonstrating that GVG-X-Omni remains competitive while tuning only 170M parameters on the frozen 7B backbone. revision: yes

  3. Referee: [GVG-Janus trimodal alignment] Trimodal extension: Extending GVG-Janus with Image+Text alignment is presented as yielding further gains, but without ablation isolating the contribution of the generated visual proxies versus text alone, or versus random visual inputs, it is unclear whether the perceptual enrichment is the operative factor.

    Authors: We have already compared text-only, image-only, and trimodal alignments and observed incremental gains for the trimodal setting. To more rigorously isolate the role of the generated proxies, we will add a new ablation experiment in the revision that replaces the EEG-conditioned proxies with random or noise-based images while keeping all other factors fixed. This control will clarify whether the observed improvements stem from semantically relevant visual content rather than simply the addition of an extra modality. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework proposal validated via independent experiments

full rationale

The paper proposes Generative Visual Grounding (GVG) as a method that uses an EEG-to-image generative model to create instance-specific visual proxies for non-visual EEG signals, which are then fed into MLLMs for improved clinical-state interpretation. The derivation consists of describing this translator role, applying it to two specific backbones (GVG-X-Omni with 170M tunable parameters on a frozen 7B model, and trimodal GVG-Janus), and reporting empirical gains in alignment and generation tasks. No equations, fitted parameters renamed as predictions, or self-citation chains are invoked to force the central claims; the results are presented as outcomes of external validation on GVG-X-Omni and GVG-Janus rather than reducing tautologically to the inputs by construction. The approach remains self-contained against the described benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that EEG encodes visualizable perceptual details and that generative models can produce useful proxies; no free parameters or invented entities beyond the proposed framework are evident from the abstract.

axioms (1)
  • domain assumption EEG signals contain fine-grained perceptual information that can be translated into instance-specific visual images via generative models
    Invoked to justify using visual proxies instead of text-only alignment for non-visual EEG.
invented entities (1)
  • Generative Visual Grounding (GVG) framework no independent evidence
    purpose: To generate visual proxy images from EEG for enhanced MLLM interpretation
    Newly introduced method in the paper.

pith-pipeline@v0.9.0 · 5772 in / 1157 out tokens · 54770 ms · 2026-05-20T10:20:15.322648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 9 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Diego Alvarez-Estevez and Roselyne M Rijsman. 2021. Inter-database validation of a deep learning approach for automatic sleep scoring.PloS one16, 8 (2021), e0256111

  3. [3]

    Yunpeng Bai, Xintao Wang, Yan-pei Cao, Yixiao Ge, Chun Yuan, and Ying Shan

  4. [4]

    Dr eamDif- fusion: Generating high-quality images from brain EEG sign als,

    Dreamdiffusion: Generating high-quality images from brain eeg signals. arXiv preprint arXiv:2306.16934(2023)

  5. [5]

    Hubert Banville, Yohann Benchetrit, Stéphane d’Ascoli, Jérémy Rapin, and Jean- Rémi King. 2025. Scaling laws for decoding images from brain activity.arXiv preprint arXiv:2501.15322(2025)

  6. [6]

    Donghong Cai, Junru Chen, Yang Yang, Teng Liu, and Yafeng Li. 2023. Mbrain: A multi-channel self-supervised learning framework for brain signals. InPro- ceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 130–141

  7. [7]

    Josue Ortega Caro, Antonio H de O Fonseca, Christopher Averill, Syed A Rizvi, Matteo Rosati, James L Cross, Prateek Mittal, Emanuele Zappala, Daniel Levine, Rahul M Dhodapkar, et al. 2023. BrainLM: A foundation model for brain activity recordings.bioRxiv(2023), 2023–09

  8. [8]

    Xuhang Chen, Baiying Lei, Chi-Man Pun, and Shuqiang Wang. 2023. Brain diffuser: An end-to-end brain image to brain network pipeline. InChinese Con- ference on Pattern Recognition and Computer Vision (PRCV). Springer, 16–26

  9. [9]

    Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. 2023. Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Infor- mation Processing Systems36 (2023), 24841–24858

  10. [10]

    Zhisheng Chen, Yingwei Zhang, Qizhen Lan, Tianyu Liu, Huacan Wang, Yi Ding, Ziyu Jia, Ronghao Chen, Kun Wang, and Xinliang Zhou. 2025. Uni-NTFM: A Unified Foundation Model for EEG Signal Representation Learning.arXiv preprint arXiv:2509.24222(2025)

  11. [11]

    Wenhui Cui, Woojae Jeong, Philipp Thölke, Takfarinas Medani, Karim Jerbi, Anand A Joshi, and Richard M Leahy. 2024. Neuro-gpt: Towards a foundation model for eeg. In2024 IEEE International Symposium on Biomedical Imaging (ISBI). IEEE, 1–5

  12. [12]

    Sicheng Dai, Hongwang Xiao, Shan Yu, and Qiwei Ye. 2026. Autoregressive Visual Decoding from EEG Signals.arXiv preprint arXiv:2602.22555(2026)

  13. [13]

    Alexandru Dimofte, Glenn Anta Bucagu, Thorir Mar Ingolfsson, Xiaying Wang, Andrea Cossettini, Luca Benini, and Yawei Li. 2025. Cerebro: Compact encoder for representations of brain oscillations using efficient alternating attention. arXiv preprint arXiv:2501.10885(2025)

  14. [14]

    Ruo-Nan Duan, Jia-Yi Zhu, and Bao-Liang Lu. 2013. Differential entropy feature for EEG-based emotion classification. In6th International IEEE/EMBS Conference on Neural Engineering (NER). IEEE, 81–84

  15. [15]

    Zitao Fang, Chenxuan Li, Hongting Zhou, Shuyang Yu, Guodong Du, Ashwaq Qasem, Yang Lu, Jing Li, Junsong Zhang, and Sim Kuan Goh. 2025. Neuript: Foundation model for neural interfaces.arXiv preprint arXiv:2510.16548(2025)

  16. [16]

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. 2025. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again. arXiv preprint arXiv:2507.22058(2025)

  17. [17]

    Amir Harati, Meysam Golmohammadi, Silvia Lopez, Iyad Obeid, and Joseph Picone. 2015. Improved EEG event classification using differential energy. In 2015 IEEE Signal Processing in Medicine and Biology Symposium (SPMB). IEEE, 1–4

  18. [18]

    Shuai Huang, Yongxiong Wang, Huan Luo, Haodong Jing, Chendong Qin, and Jingqun Tang. 2025. MINDEV: Multi-modal Integrated Diffusion Framework for Video Reconstruction from EEG Signals. InProceedings of the 33rd ACM International Conference on Multimedia. 3350–3359

  19. [19]

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. 2024. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987(2024)

  20. [20]

    Wei-Bang Jiang, Xuan-Hao Liu, Wei-Long Zheng, and Bao-Liang Lu. 2025. SEED- VII: A Multimodal Dataset of Six Basic Emotions With Continuous Labels for Emotion Recognition.IEEE Transactions on Affective Computing16, 2 (2025), 969–985. doi:10.1109/TAFFC.2024.3485057

  21. [21]

    Wei-Bang Jiang, Yansen Wang, Bao-Liang Lu, and Dongsheng Li. 2024. NeuroLM: A universal multi-task foundation model for bridging the gap between language and EEG signals.arXiv preprint arXiv:2409.00101(2024)

  22. [22]

    Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu. 2024. Large brain model for learning generic representations with tremendous EEG data in BCI.arXiv preprint arXiv:2405.18765(2024)

  23. [23]

    Jin Jing, Wendong Ge, Shenda Hong, Marta Bento Fernandes, Zhen Lin, Chaoqi Yang, Sungtae An, Aaron F Struck, Aline Herlopian, Ioannis Karakis, et al. 2023. Development of expert-level classification of seizures and rhythmic and periodic patterns during EEG interpretation.Neurology100, 17 (2023), e1750–e1762

  24. [24]

    Isaak Kavasidis, Simone Palazzo, Concetto Spampinato, Daniela Giordano, and Mubarak Shah. 2017. Brain2image: Converting brain signals into images. In Proceedings of the 25th ACM international conference on Multimedia. 1809–1817

  25. [25]

    Jonathan W Kim, Ahmed Alaa, and Danilo Bernardo. 2024. EEG-GPT: exploring capabilities of large language models for EEG classification and interpretation. arXiv preprint arXiv:2401.18006(2024)

  26. [26]

    Demetres Kostas, Stephane Aroca-Ouellette, and Frank Rudzicz. 2021. BENDR: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data.Frontiers in Human Neuroscience15 (2021), 653659

  27. [27]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. 2025. FLUX.1 Kontext: Flow Matching for In-Context ...

  28. [28]

    Yu-Ting Lan, Kan Ren, Yansen Wang, Wei-Long Zheng, Dongsheng Li, Bao-Liang Lu, and Lili Qiu. 2023. Seeing through the brain: image reconstruction of visual perception from human brain signals.arXiv preprint arXiv:2308.02510(2023)

  29. [29]

    Hongli Li, Man Ding, Ronghua Zhang, and Chunbo Xiu. 2022. Motor imagery EEG classification algorithm based on CNN-LSTM feature fusion network.Biomedical signal processing and control72 (2022), 103342

  30. [30]

    Chenyu Liu, Yuqiu Deng, Tianyu Liu, Jinan Zhou, Xinliang Zhou, Ziyu Jia, and Yi Ding. 2025. ECHO: Toward Contextual Seq2Seq Paradigms in Large EEG Models.arXiv preprint arXiv:2509.22556(2025)

  31. [31]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

  32. [32]

    Xuan-Hao Liu, Yan-Kai Liu, Yansen Wang, Kan Ren, Hanwen Shi, Zilong Wang, Dongsheng Li, Bao-Liang Lu, and Wei-Long Zheng. 2024. EEG2video: Towards decoding dynamic visual perception from EEG signals.Advances in Neural Information Processing Systems37 (2024), 72245–72273

  33. [33]

    Xuan-Hao Liu, Bao-Liang Lu, and Wei-Long Zheng. 2025. Eegmirror: Leveraging eeg data in the wild via montage-agnostic self-supervision for eeg to video decoding. InProceedings of the IEEE/CVF International Conference on Computer Vision. 18273–18283

  34. [34]

    Weiheng Lu, Chunfeng Song, Jiamin Wu, Pengyu Zhu, Yuchen Zhou, Weijian Mai, Qihao Zheng, and Wanli Ouyang. 2025. UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding.arXiv preprint arXiv:2506.18962 (2025)

  35. [35]

    Fei Ma, Han Lin, Yifan Xie, Hongwei Ren, Xiaoyu Shen, Wenbo Ding, and Qi Tian

  36. [36]

    arXiv preprint arXiv:2601.07877(2026)

    Eˆ 2-LLM: Bridging Neural Signals and Interpretable Affective Analysis. arXiv preprint arXiv:2601.07877(2026)

  37. [37]

    Wei Yan Peh, Yuanyuan Yao, and Justin Dauwels. 2022. Transformer convolu- tional neural networks for automated artifact detection in scalp EEG. In2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 3599–3602

  38. [38]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  39. [39]

    In International conference on machine learning

    Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

  40. [40]

    Yonghao Song, Xueyu Jia, Lie Yang, and Longhan Xie. 2021. Transformer- based spatial-temporal feature learning for EEG decoding.arXiv preprint arXiv:2106.11170(2021)

  41. [41]

    Concetto Spampinato, Simone Palazzo, Isaak Kavasidis, Daniela Giordano, Nasim Souly, and Mubarak Shah. 2017. Deep learning human mind for automated visual classification. InProceedings of the IEEE conference on computer vision and pattern recognition. 6809–6817

  42. [42]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  43. [43]

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. 2024. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems37 (2024), 84839–84865

  44. [44]

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. 2025. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786(2025)

  45. [45]

    Christopher Wang, Vighnesh Subramaniam, Adam Uri Yaari, Gabriel Kreiman, Boris Katz, Ignacio Cases, and Andrei Barbu. 2023. BrainBERT: Self- supervised representation learning for intracranial recordings.arXiv preprint arXiv:2302.14367(2023)

  46. [46]

    Guangyu Wang, Wenchao Liu, Yuhong He, Cong Xu, Lin Ma, and Haifeng Li

  47. [47]

    9 Pan et al

    Eegpt: Pretrained transformer for universal and reliable representation of eeg signals.Advances in Neural Information Processing Systems37 (2024), 39249–39280. 9 Pan et al

  48. [48]

    Jiquan Wang, Sha Zhao, Zhiling Luo, Yangxuan Zhou, Haiteng Jiang, Shijian Li, Tao Li, and Gang Pan. 2024. Cbramod: A criss-cross brain foundation model for eeg decoding.arXiv preprint arXiv:2412.07236(2024)

  49. [49]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

  50. [50]

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. 2024. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848(2024)

  51. [51]

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Cheng- peng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...

  52. [52]

    Chaoqi Yang, M Westover, and Jimeng Sun. 2023. Biot: Biosignal transformer for cross-data learning in the wild.Advances in Neural Information Processing Systems36 (2023), 78240–78260

  53. [53]

    Chaoqi Yang, Cao Xiao, M Brandon Westover, and Jimeng Sun. 2023. Self- supervised electroencephalogram representation learning for automatic sleep staging: model development and evaluation study.JMIR AI2, 1 (2023), e46769

  54. [54]

    Yifan Yang, Yutong Mao, Xufu Liu, and Xiao Liu. 2024. Brainmae: a region- aware self-supervised learning framework for brain signals.arXiv preprint arXiv:2406.17086(2024)

  55. [55]

    Ke Yi, Yansen Wang, Kan Ren, and Dongsheng Li. 2023. Learning topology- agnostic EEG representations with geometry-aware modeling.Advances in Neural Information Processing Systems36 (2023), 53875–53891

  56. [56]

    Zhizhang Yuan, Fanqi Shen, Meng Li, Yuguo Yu, Chenhao Tan, and Yang Yang

  57. [57]

    Brainwave: A brain signal foundation model for clinical applications.arXiv preprint arXiv:2402.10251(2024)

  58. [58]

    Tongtian Yue, Shuning Xue, Xuange Gao, Yepeng Tang, Longteng Guo, Jie Jiang, and Jing Liu. 2024. Eegpt: Unleashing the potential of eeg generalist foundation model by autoregressive pre-training.arXiv preprint arXiv:2410.19779(2024)

  59. [59]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision. 11975–11986

  60. [60]

    Daoze Zhang, Zhizhang Yuan, Yang Yang, Junru Chen, Jingjing Wang, and Yafeng Li. 2023. Brant: Foundation model for intracranial neural signal.Advances in Neural Information Processing Systems36 (2023), 26304–26321

  61. [61]

    Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. 2022. Self-supervised contrastive pre-training for time series via time-frequency con- sistency.Advances in neural information processing systems35 (2022), 3988–4003

  62. [62]

    Zheng, W

    W. Zheng, W. Liu, Y. Lu, B. Lu, and A. Cichocki. 2018. EmotionMeter: A Mul- timodal Framework for Recognizing Human Emotions.IEEE Transactions on Cybernetics(2018), 1–13. doi:10.1109/TCYB.2018.2797176

  63. [63]

    Wei-Long Zheng and Bao-Liang Lu. 2015. Investigating Critical Frequency Bands and Channels for EEG-based Emotion Recognition with Deep Neural Networks.IEEE Transactions on Autonomous Mental Development7, 3 (2015), 162–175. doi:10.1109/TAMD.2015.2431497 10