pith. sign in

arxiv: 2511.10848 · v2 · submitted 2025-11-13 · 💻 cs.LG · cs.AI

STAMP: Spatial-Temporal Adapter with Multi-Head Pooling

Pith reviewed 2026-05-17 21:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords EEG classificationtime series foundation modelsadaptersspatial-temporal modelingmulti-head poolingbrain signalsclinical tasks
0
0 comments X

The pith

A lightweight adapter lets general time series foundation models match specialized EEG models on clinical classification tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STAMP, an adapter that takes univariate embeddings from any general time series foundation model and applies multi-head pooling to handle EEG recordings. It demonstrates that this setup implicitly captures the spatial and temporal structure of brain signals and reaches performance levels comparable to EEG-specific foundation models across eight benchmark clinical datasets. The approach requires only a small number of trainable parameters and works with flexible input formats. A sympathetic reader would care because it reduces the need to train separate large models for every specialized domain like EEG.

Core claim

We introduce a novel Spatial-Temporal Adapter with Multi-Head Pooling (STAMP), which leverages univariate embeddings produced by a general TSFM, implicitly models spatial-temporal characteristics of EEG data, and achieves performance comparable to state-of-the-art EEGFMs on classification tasks.

What carries the argument

Multi-head pooling applied to univariate embeddings from a general time series foundation model to implicitly capture spatial relationships across EEG channels.

If this is right

  • General time series foundation models can be reused for EEG tasks with only lightweight adaptation.
  • Domain-specific pretraining may not be required to reach competitive results on EEG benchmarks.
  • The adapter supports different numbers of channels and input configurations for multivariate signals.
  • Computational cost for EEG modeling decreases because most parameters come from an already-trained general model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Rich temporal embeddings may contain enough information for pooling to recover channel interactions that would otherwise require explicit spatial layers.
  • The same adapter pattern could apply to other multivariate time series where channels have latent spatial or structural meaning.
  • Testing on datasets with higher channel counts or different electrode montages would show how far the implicit modeling extends.

Load-bearing premise

That univariate embeddings from a general time series foundation model combined with multi-head pooling can sufficiently capture spatial relationships across EEG channels without explicit spatial modeling or EEG-specific pretraining.

What would settle it

On a new clinical EEG classification dataset, if the STAMP-adapted general model shows a large performance gap below a dedicated EEG foundation model, that would challenge the claim of comparability.

Figures

Figures reproduced from arXiv: 2511.10848 by Abby Turner, Artur Dubrawski, Brad Shook, Jieshi Chen, Jonathan Elmer, Micha{\l} Wili\'nski, Mononito Goswami.

Figure 1
Figure 1. Figure 1: A diagram showing how EEG data is processed by MOMENT and STAMP. The EEG data is separated into tokens, which are embedded using MOMENT before positional encoding is applied. The resulting tokens are passed through the CC-GMLP, where spatial and temporal relationships are incorporated into embeddings. MHAP then determines relevant features and generates final predictions by projecting embeddings into lower… view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison between four po￾sitional encoding options: No PE (0.71M), PE-N (0.73M), PE-ST (0.72M), and PE￾NST (0.74M). The value in parentheses in￾dicates the average number of trainable pa￾rameters across the 4 datasets. Through our ablation of positional encoding (see [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison between token aggregation strategies: mean pooling (0.70M) and MHAP (0.74M). The value in parentheses indicates the average num￾ber of trainable parameters across the 4 datasets. positional encoding and token mixer choices. Specifi￾cally, performance between mean pooling and MHAP on every dataset except BCIC-IV-2a is similar. How￾ever, MHAP demonstrates a significant performance boos… view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison between four different token mixer options: B-GMLP (0.79M), CC-GMLP (0.74M), B-TF (1.25M), and CC-TF (0.99M). The value in parentheses indicates the average num￾ber of trainable parameters across the 4 datasets. In our ablation study comparing token mixing strategies, we see that CC-GMLP performs strongly across each dataset. Across all four datasets, the GMLP architecture performs b… view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison between the full evaluation of 5 methods: STAMP (0.74M), CBraMod (29M), LaBraM (5.8M), ST￾Transformer (3.5M), and EEG Conformer (0.55M). The value in parentheses indicates the average number of trainable parame￾ters across the 4 datasets. the EEGFMs. Further analysis demonstrated that STAMP can often provide the same level of perfor￾mance with even fewer parameters (see Appendix D). … view at source ↗
Figure 6
Figure 6. Figure 6: Performance comparison between using the following TSFMs with STAMP: TSPulse (1M, 0.63M), MOMENT Small (40M, 0.67M), MOMENT Base (125M, 0.7M), MOMENT Large (385M, 0.74M), and Chronos Large (710M, 0.74M). The first value in the parentheses indicates the size of the TSFM and the second value denotes the average number of trainable parame￾ters in STAMP across the 4 datasets. 6. Conclusion and Future Work We p… view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison of STAMP with varying D values [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance comparison of baselines, MO￾MENT embeddings with STAMP, and Chronos Large embeddings with STAMP on the two emotion recognition datasets: SEED-V and FACED. To further analyze the reason that STAMP performs poorly for the two emotion recognition datasets, SEED-V and FACED, we ran an addi￾tional STAMP experiment using embeddings from Chronos Large. The previously mentioned 5 seeds were used for th… view at source ↗
Figure 8
Figure 8. Figure 8: Performance comparison of STAMP when using embeddings from Chronos Large with only the EOS embedding versus an embedding from mean pooling. Large. The first 200 embeddings correspond to the length of the time series and the last embedding cor￾responds to an EOS token. For use with STAMP, we reduced these embeddings to a single representative embedding. We tested two aggregation methods: 1) mean pooling acr… view at source ↗
Figure 10
Figure 10. Figure 10: Performance comparison of STAMP using varying numbers of temporal channels. To investigate how the availability of temporal channels affects performance, we performed STAMP experiments using the first t temporal channels, where t ∈ {1, 2, 3, 4, 5}. In [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Performance comparison between MO￾MENT with mean pooling (0.04M) versus using MOMENT with STAMP (0.74M). The value in parentheses indicates the average number of trainable parameters across the 4 datasets. The most naive baseline involving MOMENT is to use only mean pooling on the embeddings [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance comparison of STAMP using D = 96 versus EEG Conformer. Both methods have ≈ 0.55M trainable param￾eters. STAMP provides superior performance compared to EEG Conformer for nearly all datasets and met￾rics evaluated. One may argue that this is due to increased capacity in STAMP and that EEG Con￾former is a more efficient method for EEG modeling. In order to demonstrate that this is not the case, … view at source ↗
Figure 13
Figure 13. Figure 13: Performance comparison between three variants of STAMP: PE-NST + Mean Pooling, PE-NST + MHAP, PE-NST + CC-GMLP + MHAP. performance is provided. This comparison demon￾strates that the adapter requires some form of rela￾tionship modeling in the form of either token mixing or MHAP [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance comparison between finetun￾ing MOMENT using LoRA and STAMP (2.3M) versus only finetuning STAMP (0.74M). The value in parentheses indi￾cates the average number of trainable pa￾rameters across the 4 datasets [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Performance comparison (across all metrics) between the full evaluation of 5 methods: STAMP, CBraMod, LaBraM, ST-Transformer, and EEG Conformer on SHU-MI, PhysioNet-MI, Menta￾lArithmetic, and BCIC-IV-2a [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Performance comparison (across all metrics) between the full evaluation of 5 methods: STAMP, CBraMod, LaBraM, ST-Transformer, and EEG Conformer on TUEV, Mumtaz2016, SEED-V, and FACED. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
read the original abstract

Time series foundation models (TSFMs) pretrained on data from multiple domains have shown strong performance on diverse modeling tasks. Various efforts have been made to develop foundation models specific to electroencephalography (EEG) data, which records brain electrical activity as time series. However, no comparative analysis of EEG-specific foundation models (EEGFMs) versus general TSFMs has been performed on EEG-specific tasks. We introduce a novel Spatial-Temporal Adapter with Multi-Head Pooling (STAMP), which leverages univariate embeddings produced by a general TSFM, implicitly models spatial-temporal characteristics of EEG data, and achieves performance comparable to state-of-the-art EEGFMs. A comprehensive analysis is performed on 8 benchmark datasets of clinical tasks using EEG for classification, along with ablation studies. Our proposed adapter is lightweight in trainable parameters and flexible in the inputs it can accommodate, supporting easy modeling of EEG data using TSFMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces STAMP, a lightweight Spatial-Temporal Adapter with Multi-Head Pooling that processes univariate embeddings produced by feeding each EEG channel independently through a general time series foundation model (TSFM). It claims this adapter implicitly captures spatial-temporal characteristics of EEG signals and achieves performance comparable to state-of-the-art EEG-specific foundation models (EEGFMs) across 8 clinical classification benchmark datasets, supported by ablation studies. The approach is presented as flexible and parameter-efficient for adapting general TSFMs to EEG tasks without domain-specific pretraining.

Significance. If the central empirical claims hold under rigorous validation, the work would be significant for demonstrating that general TSFMs can be adapted to EEG without explicit spatial operators or EEG-specific pretraining, potentially simplifying model development in neuroscience applications. The reported ablation studies and multi-benchmark evaluation provide a foundation for assessing flexibility and efficiency, though stronger evidence for the implicit spatial modeling would strengthen the contribution.

major comments (2)
  1. [Abstract] Abstract and method description: The central claim that STAMP 'implicitly models spatial-temporal characteristics of EEG data' relies on multi-head pooling over per-channel univariate TSFM embeddings recovering spatial channel correlations. However, the TSFM is pretrained on scalar time series (no topographic bias) and standard multi-head pooling is permutation-invariant, so it does not inherently encode electrode geometry, adjacency, or relative positions. Ablation studies isolating this component (e.g., comparing against random channel permutations or explicit spatial baselines) are needed to substantiate implicit modeling rather than statistical co-occurrence.
  2. [Experimental section] Experimental evaluation: The comparability to SOTA EEGFMs is reported on 8 benchmarks with ablations, but without visible full details on error bars, exact baseline re-implementations, or statistical significance tests, the support for performance parity cannot be fully assessed. This affects the strength of the claim that the adapter matches specialized EEGFMs.
minor comments (2)
  1. [Abstract] Clarify the specific general TSFM backbone used and report the exact number of trainable parameters in STAMP for reproducibility.
  2. [Results] Ensure figures or tables comparing STAMP to baselines include standard deviations or confidence intervals to aid interpretation of results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below and have revised the manuscript accordingly to clarify our claims and strengthen the experimental reporting.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: The central claim that STAMP 'implicitly models spatial-temporal characteristics of EEG data' relies on multi-head pooling over per-channel univariate TSFM embeddings recovering spatial channel correlations. However, the TSFM is pretrained on scalar time series (no topographic bias) and standard multi-head pooling is permutation-invariant, so it does not inherently encode electrode geometry, adjacency, or relative positions. Ablation studies isolating this component (e.g., comparing against random channel permutations or explicit spatial baselines) are needed to substantiate implicit modeling rather than statistical co-occurrence.

    Authors: We appreciate this observation regarding the mechanism of implicit modeling. The STAMP adapter aggregates per-channel TSFM embeddings via multi-head pooling, enabling the model to learn inter-channel statistical dependencies directly from the EEG data without explicit spatial operators or topographic pretraining. This data-driven aggregation captures the spatial correlations necessary for clinical tasks, as demonstrated by competitive performance across benchmarks. To address the request for isolating ablations, we will add a random channel permutation experiment in the revised supplementary material and include a brief comparison to an explicit spatial baseline (e.g., a lightweight graph-based aggregator on electrode positions) to highlight the parameter efficiency of our approach. We have also revised the abstract and method description to more precisely characterize the implicit nature of the spatial-temporal modeling. revision: partial

  2. Referee: [Experimental section] Experimental evaluation: The comparability to SOTA EEGFMs is reported on 8 benchmarks with ablations, but without visible full details on error bars, exact baseline re-implementations, or statistical significance tests, the support for performance parity cannot be fully assessed. This affects the strength of the claim that the adapter matches specialized EEGFMs.

    Authors: We thank the referee for this feedback on experimental rigor. In the revised manuscript, we will expand the experimental section and appendix to report: standard deviation error bars computed over five independent runs with different random seeds for all methods and datasets; detailed re-implementation protocols for each EEGFM baseline, including exact hyperparameters, data preprocessing, and training settings; and statistical significance results using paired t-tests (with p-values) between STAMP and the EEGFMs on each of the eight benchmarks. These additions will provide a more complete assessment of performance parity. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks

full rationale

The paper presents STAMP as a lightweight adapter that feeds per-channel univariate time series into a pretrained general TSFM and applies multi-head pooling to implicitly capture spatial-temporal EEG structure. All performance claims are supported by direct empirical comparisons on 8 external benchmark datasets plus ablation studies, rather than any derivation that reduces by construction to fitted parameters or self-referential definitions. No equations, uniqueness theorems, or self-citations are invoked as load-bearing premises that would collapse the central result to its own inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the transferability of general TSFM embeddings to EEG and the sufficiency of implicit spatial-temporal modeling via pooling; no explicit free parameters or invented physical entities are described in the abstract.

axioms (1)
  • domain assumption Univariate embeddings from a general TSFM pretrained on other domains transfer usefully to EEG channels.
    Invoked when the adapter is applied to EEG data without additional pretraining.
invented entities (1)
  • STAMP adapter with multi-head pooling no independent evidence
    purpose: To implicitly model spatial-temporal EEG characteristics from univariate TSFM embeddings
    New method component introduced to bridge general and EEG-specific modeling.

pith-pipeline@v0.9.0 · 5480 in / 1299 out tokens · 40264 ms · 2026-05-17T21:41:52.504868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

  1. [1]

    URL https://dx.doi.org/10.1088/ 1741-2552/ad546d

    ISSN 1741-2552. doi: 10.1088/1741-2552/ ab0ab5. URLhttps://dx.doi.org/10.1088/ 1741-2552/ab0ab5. Publisher: IOP Publishing. Wenhui Cui, Woojae Jeong, Philipp Th¨ olke, Takfari- nas Medani, Karim Jerbi, Anand A. Joshi, and Richard M. Leahy. Neuro-gpt: Towards a foun- dation model for eeg. In2024 IEEE International Symposium on Biomedical Imaging (ISBI), pa...

  2. [2]

    PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals,

    URLhttps://proceedings.mlr.press/ v235/das24c.html. Vijay Ekambaram, Subodh Kumar, Arindam Jati, Sumanta Mukherjee, Tomoya Sakai, Pankaj Dayama, Wesley M. Gifford, and Jayant Kalagnanam. Tspulse: Dual space tiny pre- trained models for rapid time-series analysis, 2025. URLhttps://arxiv.org/abs/2505.13033. A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Ha...

  3. [3]

    URLhttps://proceedings.mlr.press/ v235/goswami24a.html. A. Harati, M. Golmohammadi, S. Lopez, I. Obeid, and J. Picone. Improved EEG event clas- sification using differential energy. 2015: 10.1109/SPMB.2015.7405421, 2015. ISSN 2372-7241. doi: 10.1109/SPMB.2015.7405421. URLhttps://www.ncbi.nlm.nih.gov/pmc/ articles/PMC4874511/. Edward J Hu, Yelong Shen, Phi...

  4. [4]

    URL https://www.isca-archive.org/interspeech_ 2019/india19_interspeech.html

    doi: 10.21437/Interspeech.2019-2616. URL https://www.isca-archive.org/interspeech_ 2019/india19_interspeech.html. Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu. Large brain model for learning generic representa- tions with tremendous EEG data in BCI. InThe Twelfth International Conference on Learning Rep- resentations, 2024. URLhttps://openreview. net/fo...

  5. [5]

    EEG Conformer: Convolutional Transformer for EEG Decoding and Visualization,

    URLhttps://openreview.net/forum?id= jYluzCLFDM. Gerwin Schalk, Dennis J McFarland, Thilo Hinter- berger, Niels Birbaumer, and Jonathan R Wolpaw. Bci2000: a general-purpose brain-computer inter- face (bci) system.IEEE Transactions on Biomed- ical Engineering, 51(6):1034–1043, 2004. doi: 10. 1109/TBME.2004.827072. Yonghao Song, Xueyu Jia, Lie Yang, and Long...

  6. [6]

    Jiquan Wang, Sha Zhao, Zhiling Luo, Yangxuan Zhou, Haiteng Jiang, Shijian Li, Tao Li, and Gang Pan

    URLhttps://openreview.net/forum?id= lvS2b8CjG5. Jiquan Wang, Sha Zhao, Zhiling Luo, Yangxuan Zhou, Haiteng Jiang, Shijian Li, Tao Li, and Gang Pan. CBramod: A criss-cross brain foundation model for EEG decoding. InThe Thirteenth Inter- national Conference on Learning Representations,

  7. [7]

    , author Zhou, A

    URLhttps://openreview.net/forum?id= NPNUHgHF2w. Miao Zhao, Yufeng Ma, Yiwei Ding, Yu Zheng, Min Liu, and Minqiang Xu. Multi-query multi-head at- tention pooling and inter-topk penalty for speaker verification. InICASSP 2022 - 2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6737–6741, 2022. doi: 10.1109/ICASS...

  8. [8]

    All five seeds were used for our full evaluation

    The first three seeds were used for ablation ex- periments and hyperparameter tuning. All five seeds were used for our full evaluation. As a result of our fixed seeds, each experiment is fully reproducible. Appendix C. Hyperparameter Tuning During the development of the adapter, many hyper- parameters were searched over. Our final hyperpa- rameters were s...