pith. sign in

arxiv: 2506.09110 · v4 · submitted 2025-06-10 · 💻 cs.LG

CodeBrain: Bridging Decoupled Tokenizer and Multi-Scale Architecture for EEG Foundation Model

Pith reviewed 2026-05-19 10:10 UTC · model grok-4.3

classification 💻 cs.LG
keywords EEG foundation modeldiscrete tokenizermulti-scale architecturetemporal frequency decouplingbrain signal generalizationsmall-world topologyrepresentation interpretabilitypretrained EEG
0
0 comments X

The pith

Decoupling EEG signals into temporal and frequency tokens plus multi-scale attention lets a foundation model generalize across brain tasks and datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CodeBrain as a two-stage EEG foundation model that first converts raw brain signals into discrete tokens by separately handling their time-based and frequency-based parts. This tokenization step is meant to create richer, more distinguishable representations while also making the internal features easier to connect to known brain phenomena. The second stage then processes those tokens with an architecture that mixes broad global patterns and fine local details to match how brain networks are organized. When pretrained on a very large collection of EEG recordings, the resulting model is shown to handle eight different analysis tasks across ten separate datasets even when the data distributions change. A reader might care because EEG is used for real-time monitoring of brain activity, and better foundation models could reduce the need to build new systems for each new use case.

Core claim

CodeBrain is a two-stage EFM. In the first stage, the TFDual-Tokenizer decouples heterogeneous temporal and frequency EEG signals into discrete tokens, quadratically expanding the representation space to enhance discriminative power and offering domain-specific representation-level interpretability by suggesting potential links to neural events and spectral rhythms. In the second stage, the multi-scale EEGSSM architecture combines structured global convolution with sliding window attention to efficiently capture both sparse long-range and local dependencies, reflecting the brain's small-world topology. Pretrained on the largest public EEG corpus, CodeBrain achieves strong generalization on a

What carries the argument

The TFDual-Tokenizer that separates temporal and frequency EEG components into discrete tokens, paired with the multi-scale EEGSSM that mixes global convolution and sliding-window attention to capture both long-range and local brain dependencies.

If this is right

  • The model generalizes across eight downstream tasks on ten datasets even when data distributions shift.
  • Ablation studies, scaling-law analysis, and interpretability checks support the design choices.
  • The architecture mirrors the brain's small-world topology by handling both sparse long-range and local patterns.
  • Representation-level interpretability arises from linking tokens to neural events and spectral rhythms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token-based approach could make it easier to combine EEG data with recordings from other sensors without retraining from scratch.
  • If the scaling laws hold, larger versions of the model may continue to improve performance on rare or noisy brain signals.
  • Clinicians might one day inspect which tokens activate for specific symptoms, turning the model into a diagnostic aid rather than a black box.

Load-bearing premise

That separating temporal and frequency parts of EEG signals into tokens will both enlarge the space of possible representations and create meaningful links to actual brain events for interpretability.

What would settle it

Finding that CodeBrain shows no measurable gain in accuracy or robustness over prior EEG models when evaluated on the same eight tasks and ten datasets under distribution shifts.

Figures

Figures reproduced from arXiv: 2506.09110 by Chenyu Liu, Feng Wu, Jingying Ma, Mengling Feng, Qika Lin, Yucheng Xing, Ziyu Jia.

Figure 1
Figure 1. Figure 1: Overview of our contributions. (a) Decoupled vector quantization independently tokenizes temporal and frequency components to preserve heterogeneous EEG structures and enhance representa￾tion capacity. (b) State Space Model efficiently captures sparse global dependencies across patches. (c) Sliding Window Attention models fine-grained local dependencies within patches. To address the above challenges, we p… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the CodeBrain framework. Left: TFDual-Tokenizer learns to discretize EEG signals into temporal and frequency tokens using two separate codebooks, by reconstructing both the temporal waveforms and the frequency-domain magnitude and phase. Right: EEGSSM learns representations by predicting the discrete tokens of masked patches generated by TFDual-Tokenizer. 3.3 TFDual-Tokenizer Pretraining Our TF… view at source ↗
Figure 3
Figure 3. Figure 3: Model and training data scaling laws of CodeBrain across three datasets on Cohen’s Kappa. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Decoupled time-frequency codes visualization on ISRUC_S3 dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The model demonstrates a rapid initial decrease in loss during the first few epochs, followed [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pretraining Loss Curve of TFDual-Tokenizer [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pretraining Loss Curve of TFDual-Tokenizer. Unused Codes Analysis. During pretraining, we track the number of unused codes in both the temporal and frequency codebooks of the TFDual-Tokenizer, each with a size of 4096. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Unused code dynamics of the TFDual-Tokenizer. B.2 EEGSSSM Pretraining Results We plot the pretraining loss curve of EEGSSM in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pretraining Loss Curve of EEGSSM. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Class-Specific Code Ratio Across Different Codebooks. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Effect of Contrastive Loss on Temporal Codebook Learning. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Computational Overhead of Using Different Backbones in the [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Performance Across Different Mask Ratios on FACED, SEED-V, and ISRUC_S3. [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: EEGSSM Pre-Training Loss Curve for Different Mask Ratios. To further illustrate this pattern, we visualize the training loss curves across different mask ratios in [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: As expected, larger pretraining data consistently lead to lower training loss, indicating more [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 14
Figure 14. Figure 14: Training Data Scaling Laws on FACED, SEED-V, and ISRUC_S3. [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: EEGSSM Pre-Training Loss Curve for Different Training Data Volume [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Model Size Scaling Laws on FACED, SEED-V, and ISRUC_S3. [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: EEGSSM Pre-Training Loss Curve for Different Model Size. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗
read the original abstract

Electroencephalography (EEG) provides real-time insights into brain activity and supports diverse applications in neuroscience. While EEG foundation models (EFMs) have emerged to address the scalability issues of task-specific models, current approaches still yield clinically uninterpretable and weakly discriminative representations, inefficiently capturing global dependencies and neglecting important local neural events. We present CodeBrain, a two-stage EFM designed to fill this gap. In the first stage, we introduce the TFDual-Tokenizer, which decouples heterogeneous temporal and frequency EEG signals into discrete tokens, quadratically expanding the representation space to enhance discriminative power and offering domain-specific representation-level interpretability by suggesting potential links to neural events and spectral rhythms. In the second stage, we propose the multi-scale EEGSSM architecture, which combines structured global convolution with sliding window attention to efficiently capture both sparse long-range and local dependencies, reflecting the brain's small-world topology. Pretrained on the largest public EEG corpus, CodeBrain achieves strong generalization across eight downstream tasks and ten datasets under distribution shifts, supported by comprehensive ablations, scaling-law analyzes, and interpretability evaluations. The code and the pretrained weights are available at https://github.com/jingyingma01/CodeBrain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. CodeBrain is a two-stage EEG foundation model. Stage one introduces the TFDual-Tokenizer that decouples temporal and frequency EEG signals into discrete tokens, claimed to quadratically expand the representation space and supply domain-specific interpretability via links to neural events and spectral rhythms. Stage two deploys the multi-scale EEGSSM architecture that combines structured global convolution with sliding-window attention to capture sparse long-range and local dependencies consistent with brain small-world topology. The model is pretrained on the largest public EEG corpus and evaluated for generalization across eight downstream tasks on ten datasets under distribution shifts, accompanied by ablations, scaling-law analyses, and interpretability studies. Code and pretrained weights are released.

Significance. If the performance and attribution claims hold, the work would advance EEG foundation models by improving both discriminative power and neurophysiological interpretability while respecting brain topology. The public release of code and weights, together with scaling-law analyses and comprehensive ablations, constitutes a clear strength that supports reproducibility and further research.

major comments (3)
  1. [§3.1] §3.1 (TFDual-Tokenizer description): The assertion that decoupling into two discrete codebooks 'quadratically expands the representation space' is not accompanied by a derivation or explicit comparison. It remains unclear whether the model uses the Cartesian product of the two codebooks (size |C_t| × |C_f|) or a simple concatenation; without this formalization or an ablation against a single unified tokenizer, the claimed enhancement in discriminative power cannot be rigorously attributed to the decoupling step.
  2. [§4.3] §4.3 and interpretability subsection: The claim of 'domain-specific representation-level interpretability' via potential links to neural events and spectral rhythms is presented without quantitative validation, such as alignment metrics, statistical tests, or controls that compare learned token assignments against established neurophysiological markers. This leaves the interpretability benefit motivational rather than demonstrated, weakening the link to downstream gains.
  3. [Table 2] Table 2 (main results under distribution shifts): The reported generalization across ten datasets lacks error bars, confidence intervals, or statistical significance tests relative to baselines. Given that the central claim rests on 'strong generalization,' the absence of these elements makes it difficult to assess whether observed improvements are robust or attributable to the proposed architecture.
minor comments (2)
  1. [Abstract] Abstract: 'scaling-law analyzes' should read 'scaling-law analyses'.
  2. [Figure 2] Figure 2 (architecture diagram): The caption and legend could more explicitly distinguish the flow from TFDual-Tokenizer outputs to the EEGSSM blocks to aid reader comprehension.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments on our manuscript. We have carefully considered each point and outline our responses below, along with the revisions we plan to implement.

read point-by-point responses
  1. Referee: §3.1 (TFDual-Tokenizer description): The assertion that decoupling into two discrete codebooks 'quadratically expands the representation space' is not accompanied by a derivation or explicit comparison. It remains unclear whether the model uses the Cartesian product of the two codebooks (size |C_t| × |C_f|) or a simple concatenation; without this formalization or an ablation against a single unified tokenizer, the claimed enhancement in discriminative power cannot be rigorously attributed to the decoupling step.

    Authors: We appreciate this observation. Upon review, the TFDual-Tokenizer indeed employs separate discrete codebooks for temporal and frequency signals, with the effective representation space being the Cartesian product |C_t| × |C_f|. In the revised version, we will include a mathematical derivation of this quadratic expansion relative to a unified tokenizer, clarify the usage of the product, and add an ablation study comparing it to a single codebook approach to rigorously demonstrate the improvement in discriminative power. revision: yes

  2. Referee: §4.3 and interpretability subsection: The claim of 'domain-specific representation-level interpretability' via potential links to neural events and spectral rhythms is presented without quantitative validation, such as alignment metrics, statistical tests, or controls that compare learned token assignments against established neurophysiological markers. This leaves the interpretability benefit motivational rather than demonstrated, weakening the link to downstream gains.

    Authors: We agree that stronger quantitative evidence would enhance this section. We will revise the interpretability subsection to include quantitative validation, such as alignment metrics between token assignments and known neural events (e.g., P300, mu rhythm) and statistical tests against random baselines or controls, to better demonstrate the domain-specific interpretability and its connection to performance gains. revision: yes

  3. Referee: Table 2 (main results under distribution shifts): The reported generalization across ten datasets lacks error bars, confidence intervals, or statistical significance tests relative to baselines. Given that the central claim rests on 'strong generalization,' the absence of these elements makes it difficult to assess whether observed improvements are robust or attributable to the proposed architecture.

    Authors: We thank the referee for highlighting this. In the updated manuscript, we will augment Table 2 with error bars (standard deviations across runs or datasets), confidence intervals, and statistical significance tests (such as t-tests with p-values) against the baseline methods to provide a more robust assessment of the generalization performance under distribution shifts. revision: yes

Circularity Check

1 steps flagged

TFDual-Tokenizer quadratic expansion presented as derived benefit but follows directly from dual codebook definition

specific steps
  1. self definitional [Abstract, first-stage description]
    "we introduce the TFDual-Tokenizer, which decouples heterogeneous temporal and frequency EEG signals into discrete tokens, quadratically expanding the representation space to enhance discriminative power and offering domain-specific representation-level interpretability by suggesting potential links to neural events and spectral rhythms."

    The quadratic expansion is claimed as a benefit that enhances discriminative power, yet it is the immediate result of defining the tokenizer via two separate codebooks whose combined space size is their product; this holds by construction for any dual discrete tokenizer and does not require EEG-specific properties or additional derivation.

full rationale

The paper's central architectural claim in the first stage asserts that decoupling temporal and frequency signals into discrete tokens quadratically expands the representation space and supplies domain-specific interpretability. This expansion is a direct arithmetic consequence of combining two independent codebooks (product of their sizes) rather than an independent derivation from EEG signal properties or empirical validation. No equations are shown demonstrating quadratic growth beyond the definitional product, and the interpretability is framed as 'suggesting potential links' without quantitative mapping to neural events. The downstream generalization claims rest on this premise, but the expansion itself reduces to the tokenizer's construction. The multi-scale architecture and pretraining claims do not exhibit similar reductions and appear independent of fitted inputs or self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on the domain assumption that EEG signals contain separable temporal and frequency components whose decoupling yields interpretable tokens, plus the modeling choice that brain small-world topology is best captured by global convolution plus sliding-window attention. No explicit free parameters or invented physical entities are named in the abstract.

axioms (2)
  • domain assumption EEG signals contain heterogeneous temporal and frequency components that can be decoupled into discrete tokens for improved representation
    Invoked in the description of the TFDual-Tokenizer stage.
  • domain assumption The brain's small-world topology is effectively modeled by combining structured global convolution with sliding window attention
    Invoked in the description of the multi-scale EEGSSM architecture.
invented entities (2)
  • TFDual-Tokenizer no independent evidence
    purpose: Decouple temporal and frequency EEG signals into discrete tokens
    New component introduced in stage one to expand representation space and add interpretability.
  • EEGSSM no independent evidence
    purpose: Multi-scale architecture capturing sparse long-range and local dependencies
    New architecture introduced in stage two.

pith-pipeline@v0.9.0 · 5763 in / 1505 out tokens · 47822 ms · 2026-05-19T10:10:20.599438+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Foundation Model Guided Dual-Branch Co-Adaptation for Source-Free EEG Decoding

    eess.SP 2026-04 unverdicted novelty 7.0

    FUSED integrates EEG foundation models into source-free domain adaptation via dual-branch co-adaptation, consensus filtering, and two-stage pseudo-label refinement to achieve state-of-the-art cross-subject EEG decoding.

  2. PRiSE-EEG: A Prior-Guided Foundation Model with Depth-Stratified Experts for Cross-Paradigm EEG Representation Learning

    eess.SP 2026-05 unverdicted novelty 6.0

    PRiSE-EEG is a prior-guided EEG foundation model that allocates shared and specialized experts across depth using CKA-derived sigmoid mappings and reports strong cross-paradigm results on 12 benchmarks.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 2 Pith papers · 6 internal anchors

  1. [1]

    Lippincott Williams & Wilkins, 2005

    Ernst Niedermeyer and FH Lopes da Silva.Electroencephalography: basic principles, clinical applications, and related fields. Lippincott Williams & Wilkins, 2005

  2. [2]

    Eeg and meg: relevance to neuroscience.Neuron, 80(5):1112–1128, 2013

    Fernando Lopes da Silva. Eeg and meg: relevance to neuroscience.Neuron, 80(5):1112–1128, 2013

  3. [3]

    Cognitive impairment during epileptiform discharges: is it ever justifiable to treat the eeg? The Lancet Neurology, 2(12):725–730, 2003

    Colin D Binnie. Cognitive impairment during epileptiform discharges: is it ever justifiable to treat the eeg? The Lancet Neurology, 2(12):725–730, 2003

  4. [4]

    Xsleepnet: Multi-view sequential model for automatic sleep staging.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5903–5915, 2021

    Huy Phan, Oliver Y Chén, Minh C Tran, Philipp Koch, Alfred Mertins, and Maarten De Vos. Xsleepnet: Multi-view sequential model for automatic sleep staging.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5903–5915, 2021

  5. [5]

    Explainable vision transformer for automatic visual sleep staging on multimodal psg signals

    Hyojin Lee, You Rim Choi, Hyun Kyung Lee, Jaemin Jeong, Joopyo Hong, Hyun-Woo Shin, and Hyung-Sin Kim. Explainable vision transformer for automatic visual sleep staging on multimodal psg signals. npj Digital Medicine, 8(1):55, 2025

  6. [6]

    St-usleepnet: A spatial-temporal coupling prominence network for multi-channel sleep staging.arXiv preprint arXiv:2408.11884, 2024

    Jingying Ma, Qika Lin, Ziyu Jia, and Mengling Feng. St-usleepnet: A spatial-temporal coupling prominence network for multi-channel sleep staging.arXiv preprint arXiv:2408.11884, 2024

  7. [7]

    Sst-emotionnet: Spatial-spectral-temporal based attention 3d dense network for eeg emotion recognition

    Ziyu Jia, Youfang Lin, Xiyang Cai, Haobin Chen, Haijun Gou, and Jing Wang. Sst-emotionnet: Spatial-spectral-temporal based attention 3d dense network for eeg emotion recognition. In Proceedings of the 28th ACM international conference on multimedia, pages 2909–2917, 2020

  8. [8]

    Eeg emotion recognition based on dynamical graph attention network

    Yi Guo, Chao Tang, Hao Wu, and Badong Chen. Eeg emotion recognition based on dynamical graph attention network. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1921–1925. IEEE, 2024

  9. [9]

    Dmmr: Cross-subject domain generalization for eeg-based emotion recognition via denoising mixed mutual reconstruction

    Yiming Wang, Bin Zhang, and Yujiao Tang. Dmmr: Cross-subject domain generalization for eeg-based emotion recognition via denoising mixed mutual reconstruction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 628–636, 2024

  10. [10]

    St-gf: Graph-based fusion of spatial and temporal features for eeg motor imagery decoding

    Xuhui Wang, Kui Zhao, Enze Shi, Sigang Yu, Geng Chen, and Shu Zhang. St-gf: Graph-based fusion of spatial and temporal features for eeg motor imagery decoding. In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 3811–3816. IEEE, 2024

  11. [11]

    Emre Arı and Ertuğrul Taçgın. Nf-eeg: A generalized cnn model for multi class eeg motor imagery classification without signal preprocessing for brain computer interfaces.Biomedical Signal Processing and Control, 92:106081, 2024

  12. [12]

    Learning space-time-frequency representation with two-stream attention based 3d network for motor imagery classification

    Zhenqi Li, Jing Wang, Ziyu Jia, and Youfang Lin. Learning space-time-frequency representation with two-stream attention based 3d network for motor imagery classification. In2020 IEEE International Conference on Data Mining (ICDM), pages 1124–1129. IEEE, 2020

  13. [13]

    Exploring the diagnostic potential of llms in schizophrenia detection through eeg analysis

    Michele Guerra, Roberto Milanese, Michele Deodato, Madalina G Ciobanu, and Fausto Fasano. Exploring the diagnostic potential of llms in schizophrenia detection through eeg analysis. In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 6812–6819. IEEE, 2024

  14. [14]

    Exploring large-scale language models to evaluate eeg-based multimodal data for mental health

    Yongquan Hu, Shuning Zhang, Ting Dang, Hong Jia, Flora D Salim, Wen Hu, and Aaron J Quigley. Exploring large-scale language models to evaluate eeg-based multimodal data for mental health. In Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 412–417, 2024

  15. [15]

    Brain foundation models: A survey on advancements in neural signal processing and brain discovery

    Xinliang Zhou, Chenyu Liu, Zhisheng Chen, Kun Wang, Yi Ding, Ziyu Jia, and Qingsong Wen. Brain foundation models: A survey on advancements in neural signal processing and brain discovery. arXiv preprint arXiv:2503.00580, 2025

  16. [16]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InAdvances in neural information processing systems, volume 30, 2017. 11

  17. [17]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  18. [18]

    Eegpt: Pretrained transformer for universal and reliable representation of eeg signals

    Guangyu Wang, Wenchao Liu, Yuhong He, Cong Xu, Lin Ma, and Haifeng Li. Eegpt: Pretrained transformer for universal and reliable representation of eeg signals. In Advances in Neural Information Processing Systems, volume 37, pages 39249–39280, 2024

  19. [19]

    Cbramod: A criss-cross brain foundation model for eeg decoding

    Jiquan Wang, Sha Zhao, Zhiling Luo, Yangxuan Zhou, Haiteng Jiang, Shijian Li, Tao Li, and Gang Pan. Cbramod: A criss-cross brain foundation model for eeg decoding. InThe Third International Conference on Learning Representations, 2025

  20. [20]

    Large brain model for learning generic represen- tations with tremendous eeg data in bci

    Weibang Jiang, Liming Zhao, and Bao-liang Lu. Large brain model for learning generic represen- tations with tremendous eeg data in bci. InThe Twelfth International Conference on Learning Representations, 2024

  21. [21]

    Tokenizing Single-Channel EEG with Time-Frequency Motif Learning

    Jathurshan Pradeepkumar, Xihao Piao, Zheng Chen, and Jimeng Sun. Single-channel eeg tok- enization through time-frequency modeling.arXiv preprint arXiv:2502.16060, 2025

  22. [22]

    Online clustered codebook

    Chuanxia Zheng and Andrea Vedaldi. Online clustered codebook. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22798–22807, 2023

  23. [23]

    Finite scalar quantiza- tion: Vq-vae made simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantiza- tion: Vq-vae made simple. InThe Twelfth International Conference on Learning Representations, 2024

  24. [24]

    Decomposing eeg data into space–time–frequency components using parallel factor analysis.NeuroImage, 22(3):1035–1045, 2004

    Fumikazu Miwakeichi, Eduardo Martınez-Montes, Pedro A Valdés-Sosa, Nobuaki Nishiyama, Hiroaki Mizuhara, and Yoko Yamaguchi. Decomposing eeg data into space–time–frequency components using parallel factor analysis.NeuroImage, 22(3):1035–1045, 2004

  25. [25]

    Vector quantization for recommender systems: a review and outlook

    Qijiong Liu, Xiaoyu Dong, Jiaren Xiao, Nuo Chen, Hengchang Hu, Jieming Zhu, Chenxu Zhu, Tetsuya Sakai, and Xiao-Ming Wu. Vector quantization for recommender systems: a review and outlook. arXiv preprint arXiv:2405.03110, 2024

  26. [26]

    Complex brain networks: graph theoretical analysis of structural and functional systems.Nature reviews neuroscience, 10(3):186–198, 2009

    Ed Bullmore and Olaf Sporns. Complex brain networks: graph theoretical analysis of structural and functional systems.Nature reviews neuroscience, 10(3):186–198, 2009

  27. [27]

    Small-world brain networks

    Danielle Smith Bassett and ED Bullmore. Small-world brain networks. The neuroscientist, 12(6):512–523, 2006

  28. [28]

    Uncovering intrinsic modular organization of spontaneous brain activity in humans.PloS one, 4(4):e5226, 2009

    Yong He, Jinhui Wang, Liang Wang, Zhang J Chen, Chaogan Yan, Hong Yang, Hehan Tang, Chaozhe Zhu, Qiyong Gong, Yufeng Zang, et al. Uncovering intrinsic modular organization of spontaneous brain activity in humans.PloS one, 4(4):e5226, 2009

  29. [29]

    Biot: Biosignal transformer for cross-data learning in the wild

    Chaoqi Yang, M Westover, and Jimeng Sun. Biot: Biosignal transformer for cross-data learning in the wild. InAdvances in Neural Information Processing Systems, volume 36, pages 78240–78260, 2023

  30. [30]

    Eeg2rep: enhancing self-supervised eeg representation through informative masked inputs

    Navid Mohammadi Foumani, Geoffrey Mackellar, Soheila Ghane, Saad Irtza, Nam Nguyen, and Mahsa Salehi. Eeg2rep: enhancing self-supervised eeg representation through informative masked inputs. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5544–5555, 2024

  31. [31]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, volume 30, 2017

  32. [32]

    Long range arena : A benchmark for efficient transformers

    Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient transformers. In The Ninth International Conference on Learning Representations, 2021. 12

  33. [33]

    An Efficient Self-Supervised Framework for Long-Sequence EEG Modeling

    Jiazhen Hong, Geoffrey Mackellar, and Soheila Ghane. Eegm2: An efficient mamba-2-based self-supervised framework for long-sequence eeg modeling.arXiv preprint arXiv:2502.17873, 2025

  34. [34]

    Femba: Effi- cient and scalable eeg analysis with a bidirectional mamba foundation model.arXiv preprint arXiv:2502.06438, 2025

    Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, and Yawei Li. Femba: Effi- cient and scalable eeg analysis with a bidirectional mamba foundation model.arXiv preprint arXiv:2502.06438, 2025

  35. [35]

    Springer Publishing Company, 2021

    William O Tatum IV.Handbook of EEG interpretation. Springer Publishing Company, 2021

  36. [36]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

  37. [37]

    Simplified state space layers for sequence modeling

    Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. InICLR, 2023

  38. [38]

    What makes convolutional models great on long sequence modeling?arXiv preprint arXiv:2210.09298, 2022

    Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. What makes convolutional models great on long sequence modeling?arXiv preprint arXiv:2210.09298, 2022

  39. [39]

    On the parameterization and ini- tialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022

    Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and ini- tialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022

  40. [40]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  41. [41]

    Internimage: Exploring large-scale vision foundation models with deformable convolutions

    Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14408–14419, 2023

  42. [42]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  43. [43]

    Graphsleepnet: Adaptive spatial-temporal graph convolutional networks for sleep stage classification

    Ziyu Jia, Youfang Lin, Jing Wang, Ronghao Zhou, Xiaojun Ning, Yuanlai He, and Yaoshuai Zhao. Graphsleepnet: Adaptive spatial-temporal graph convolutional networks for sleep stage classification. InIjcai, volume 2021, pages 1324–1330, 2020

  44. [44]

    Caresleepnet: a hybrid deep learning network for automatic sleep staging.IEEE Journal of Biomedical and Health Informatics, 2024

    Jiquan Wang, Sha Zhao, Haiteng Jiang, Yangxuan Zhou, Zhenghe Yu, Tao Li, Shijian Li, and Gang Pan. Caresleepnet: a hybrid deep learning network for automatic sleep staging.IEEE Journal of Biomedical and Health Informatics, 2024

  45. [45]

    Long-term eeg partitioning for seizure onset detection

    Zheng Chen, Yasuko Matsubara, Yasushi Sakurai, and Jimeng Sun. Long-term eeg partitioning for seizure onset detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14221–14229, 2025

  46. [46]

    Large cognition model: Towards pretrained eeg foundation model.arXiv preprint arXiv:2502.17464, 2025

    Chi-Sheng Chen, Ying-Jung Chen, and Aidan Hung-Wen Tsai. Large cognition model: Towards pretrained eeg foundation model.arXiv preprint arXiv:2502.17464, 2025

  47. [47]

    Bendr: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of eeg data.Frontiers in Human Neuroscience, 15:653659, 2021

    Demetres Kostas, Stephane Aroca-Ouellette, and Frank Rudzicz. Bendr: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of eeg data.Frontiers in Human Neuroscience, 15:653659, 2021

  48. [48]

    Brant: Foundation model for intracranial neural signal

    Daoze Zhang, Zhizhang Yuan, Yang Yang, Junru Chen, Jingjing Wang, and Yafeng Li. Brant: Foundation model for intracranial neural signal. Advances in Neural Information Processing Systems, 36:26304–26321, 2023

  49. [49]

    Brant-2: Foundation model for brain signals.CoRR, 2024

    Zhizhang Yuan, Daoze Zhang, Junru Chen, Gefei Gu, and Yang Yang. Brant-2: Foundation model for brain signals.CoRR, 2024

  50. [50]

    Brant-x: A unified physiological signal alignment framework

    Daoze Zhang, Zhizhang Yuan, Junru Chen, Kerui Chen, and Yang Yang. Brant-x: A unified physiological signal alignment framework. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4155–4166, 2024. 13

  51. [51]

    Neurolm: A universal multi-task foundation model for bridging the gap between language and eeg signals

    Weibang Jiang, Yansen Wang, Bao-liang Lu, and Dongsheng Li. Neurolm: A universal multi-task foundation model for bridging the gap between language and eeg signals. InThe Thirteenth International Conference on Learning Representations, 2025

  52. [52]

    Neural machine translation of rare words with subword units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In54th Annual Meeting of the Association for Computational Linguistics, pages 1715–1725. Association for Computational Linguistics (ACL), 2016

  53. [53]

    Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, 2018

  54. [54]

    Deep state space models for time series forecasting.Advances in neural information processing systems, 31, 2018

    Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting.Advances in neural information processing systems, 31, 2018

  55. [55]

    Hippo: Recurrent memory with optimal polynomial projections.Advances in neural information processing systems, 33:1474– 1487, 2020

    Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections.Advances in neural information processing systems, 33:1474– 1487, 2020

  56. [56]

    Simplified State Space Layers for Sequence Modeling

    Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022

  57. [57]

    Aniruddh Raghu, Payal Chandak, Ridwan Alam, John Guttag, and Collin M. Stultz. Sequential multi-dimensional self-supervised learning for clinical time series, 2023

  58. [58]

    Fu, Tri Dao, Khaled K

    Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models.arXiv preprint arXiv:2212.14052, 2022

  59. [59]

    Deep latent state space models for time-series generation

    Linqi Zhou, Michael Poli, Winnie Xu, Stefano Massaroli, and Stefano Ermon. Deep latent state space models for time-series generation. InInternational Conference on Machine Learning, pages 42625–42643. PMLR, 2023

  60. [60]

    Eeg-ssm: Leveraging state-space model for dementia detection

    Xuan-The Tran, LinhLe, QuocToan Nguyen, Thomas Do, andChin-Teng Lin. Eeg-ssm: Leveraging state-space model for dementia detection.arXiv preprint arXiv:2407.17801, 2024

  61. [61]

    Eegmamba: Bidirectional state space model with mixture of experts for eeg multi-task classification, 2024

    Yiyu Gui, MingZhi Chen, Yuqi Su, Guibo Luo, and Yuchao Yang. Eegmamba: Bidirectional state space model with mixture of experts for eeg multi-task classification, 2024

  62. [62]

    An algorithm for the machine calculation of complex fourier series

    James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965

  63. [63]

    Clocs: Contrastive learning of cardiac signals across space, time, and patients

    Dani Kiyasseh, Tingting Zhu, and David A Clifton. Clocs: Contrastive learning of cardiac signals across space, time, and patients. InInternational Conference on Machine Learning, pages 5606–5615. PMLR, 2021

  64. [64]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  65. [65]

    Root mean square layer normalization, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization, 2019

  66. [66]

    The temple university hospital eeg data corpus.Frontiers in neuroscience, 10:196, 2016

    Iyad Obeid and Joseph Picone. The temple university hospital eeg data corpus.Frontiers in neuroscience, 10:196, 2016

  67. [67]

    The ten-twenty electrode system of the international federation.Electroenceph clin Neurophysiol, 10:367–380, 1958

    JASPER HH. The ten-twenty electrode system of the international federation.Electroenceph clin Neurophysiol, 10:367–380, 1958

  68. [68]

    A large finer-grained affective computing eeg dataset.Scientific Data, 10(1):740, 2023

    Jingjing Chen, Xiaobin Wang, Chen Huang, Xin Hu, Xinke Shen, and Dan Zhang. A large finer-grained affective computing eeg dataset.Scientific Data, 10(1):740, 2023. 14

  69. [69]

    Wei Liu, Jie-Lin Qiu, Wei-Long Zheng, and Bao-Liang Lu. Comparing recognition performance and robustness of multimodal deep learning models for multimodal emotion recognition.IEEE Transactions on Cognitive and Developmental Systems, 14(2):715–729, 2021

  70. [70]

    Isruc-sleep: A com- prehensive public dataset for sleep researchers.Computer methods and programs in biomedicine, 124:180–192, 2016

    Sirvan Khalighi, Teresa Sousa, José Moutinho Santos, and Urbano Nunes. Isruc-sleep: A com- prehensive public dataset for sleep researchers.Computer methods and programs in biomedicine, 124:180–192, 2016

  71. [71]

    2020 international brain– computer interface competition: A review.Frontiers in human neuroscience, 16:898300, 2022

    Ji-Hoon Jeong, Jeong-Hyun Cho, Young-Eun Lee, Seo-Hyun Lee, Gi-Hwan Shin, Young-Seok Kweon, José del R Millán, Klaus-Robert Müller, and Seong-Whan Lee. 2020 international brain– computer interface competition: A review.Frontiers in human neuroscience, 16:898300, 2022

  72. [72]

    Application of machine learning to epileptic seizure onset detection and treatment

    Ali Hossam Shoeb. Application of machine learning to epileptic seizure onset detection and treatment. PhD thesis, Massachusetts Institute of Technology, 2009

  73. [73]

    MDD Patients and Healthy Controls EEG Data (New)

    Wajid Mumtaz. MDD Patients and Healthy Controls EEG Data (New). Figshare, November 2016

  74. [74]

    Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals

    Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101(23):e215–e220, 2000

  75. [75]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. InThe Tenth International Conference on Learning Representations, 2022

  76. [76]

    Simple hardware-efficient long convolutions for sequence modeling

    Daniel Y Fu, Elliot L Epstein, Eric Nguyen, Armin W Thomas, Michael Zhang, Tri Dao, Atri Rudra, and Christopher Ré. Simple hardware-efficient long convolutions for sequence modeling. In International Conference on Machine Learning, pages 10373–10391. PMLR, 2023

  77. [77]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  78. [78]

    Richard B Berry, Rita Brooks, Charlene E Gamaldo, Susan M Harding, Carole Marcus, Bradley V Vaughn, et al. The aasm manual for the scoring of sleep and associated events.Rules, Terminology and Technical Specifications, Darien, Illinois, American Academy of Sleep Medicine, 176(2012):7, 2012

  79. [79]

    Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces

    Vernon J Lawhern, Amelia J Solon, Nicholas R Waytowich, Stephen M Gordon, Chou P Hung, and Brent J Lance. Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces. Journal of neural engineering, 15(5):056013, 2018

  80. [80]

    Transformer-based spatial-temporal feature learning for eeg decoding.arXiv preprint arXiv:2106.11170, 2021

    Yonghao Song, Xueyu Jia, Lie Yang, and Longhan Xie. Transformer-based spatial-temporal feature learning for eeg decoding.arXiv preprint arXiv:2106.11170, 2021. 15 A Preliminaries Convolution State Space Models The state-space model is a classic model in control theory, and it represents the operational state of a system using first-order differential equa...

Showing first 80 references.