pith. sign in

arxiv: 2606.19888 · v1 · pith:D2OM6ZDFnew · submitted 2026-06-18 · 💻 cs.LG · cs.AI

SL-S4Wave: Self-Supervised Learning of Physiological Waveforms with Structured State Space Models

Pith reviewed 2026-06-26 18:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords self-supervised learningstructured state space modelsphysiological waveformsarrhythmia detectioncontrastive learningECGEEG
0
0 comments X

The pith

A modified S4 encoder in a contrastive framework learns representations from unlabeled ECG and EEG data that support strong arrhythmia detection with fewer labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SL-S4Wave to address challenges in modeling long, noisy, high-resolution multichannel physiological signals such as ECGs. It combines contrastive self-supervised learning with a structured state space model encoder that adds multi-layer global convolution using multiscale subkernels. This targets both local patterns and extended temporal dependencies where prior CNN-based SSL methods fall short. Experiments on real datasets show gains in arrhythmia detection accuracy, label efficiency, handling of long segments, generalization to new arrhythmia types, and performance on EEG tasks.

Core claim

SL-S4Wave shows that pairing contrastive learning with an S4 encoder modified by multi-layer global convolution using multiscale subkernels produces waveform representations that outperform supervised and self-supervised baselines on arrhythmia detection, maintain performance with limited labels and long sequences, generalize across arrhythmia types, and extend effectively to EEG tasks.

What carries the argument

The multi-layer global convolution using multiscale subkernels inside the S4 encoder, which jointly extracts fine local details and long-range temporal structure from noisy multichannel inputs.

If this is right

  • Outperforms state-of-the-art supervised and self-supervised baselines on arrhythmia detection.
  • Reaches high performance using significantly fewer labeled examples.
  • Maintains robust results on long waveform segments where most prior methods degrade.
  • Transfers effectively to arrhythmia types absent from training data.
  • Delivers superior results over strong baselines across multiple EEG tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The encoder change may reduce the volume of expert annotations needed to train clinical waveform classifiers.
  • The same multiscale convolution pattern inside S4 layers could apply to other high-rate sensor streams such as wearable device data.
  • Testing whether the encoder alone, without the contrastive pretraining step, already improves fully supervised training would isolate the contribution of the architectural change.

Load-bearing premise

That adding multi-layer global convolution with multiscale subkernels to an S4 encoder will capture both fine-grained local patterns and long-range temporal dependencies in noisy high-resolution multichannel waveforms better than existing S4 or CNN-based SSL designs.

What would settle it

A replication of the arrhythmia detection experiments on an independent dataset in which SL-S4Wave requires as many labeled examples as the strongest CNN-based SSL baseline to reach equivalent accuracy.

Figures

Figures reproduced from arXiv: 2606.19888 by Eric Lehman, Feng Wu, Gari Clifford, Guoshuai Zhao, Harsh Deep, Li-wei H Lehman, Rahul Krishnan, Sanyam Kapoor.

Figure 1
Figure 1. Figure 1: Overview of proposed Self-Supervised Learning Structured State Space Model (SL-S4Wave). Left: [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fine-Tuning Data Efficiency. We plot the fine-tuning performance vs. the fraction of training data used for fine-tuning on the VTaC (N = 5, 037) and MIMIC II-VT (N = 2, 802) benchmarks and report average of 5 seeds. Even with only 10% of the labeled data, SL-S4Wave matches or surpasses competing self-supervised methods in both Challenge Score and AUC, and continues to scale more favorably as additional dat… view at source ↗
Figure 3
Figure 3. Figure 3: Results using different sequence lengths with SL-S4Wave and baseline model in the two dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Broad transferability of SL-S4Wave. We further show that SL-S4Wave provides a strong pretrained model for adaptation to a wide variety of datasets related to various types of arrhythmia. This figure shows the 5 different arrhythmia results in MIMIC dataset. In all cases, SL-S4Wave tends to at least match or exceed the performance (as measured by Challenge Score) of the best method trained from scratch. Det… view at source ↗
Figure 5
Figure 5. Figure 5: The results of SL-S4Wave on the VTaC dataset with different numbers of blocks. [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the initial weights for different kernel sizes [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example of using the filter, the filter eliminates high-frequency noise present in the data. [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example true vs. false VT alarms. Each plot shows data in the 10-second interval immediately [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example of abnormal precursor prior to VT alarm onset. Abnormal waveform pattern appears [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗
read the original abstract

Modeling long-sequence medical time series data, such as electrocardiograms (ECG), poses significant challenges due to high sampling rates, multichannel signal complexity, inherent noise, and limited labeled data. While recent self-supervised learning (SSL) methods, based on various encoder architectures such as convolutional neural networks, have been proposed to learn representations from unlabeled data, they often fall short in capturing long-range dependencies and noise-invariant features. Structured state space models (S4) excel at long-sequence modeling, but existing S4 architectures fail to capture the unique characteristics of multichannel physiological waveforms. In this work, we propose SL-S4Wave, a self-supervised learning framework that combines contrastive learning with a tailored encoder built on structured state space models. The encoder incorporates multi-layer global convolution using multiscale subkernels, enabling the capture of both fine-grained local patterns and long-range temporal dependencies in noisy, high-resolution multichannel waveforms. Extensive experiments on real-world datasets demonstrate that SL-S4Wave (1) consistently outperforms state-of-the-art supervised and self-supervised baselines in a challenging arrhythmia detection task, (2) achieves high performance with significantly fewer labeled examples, showcasing strong label efficiency, and (3) maintains robust performance on long waveform segments, highlighting its capacity to model complex temporal dynamics in long sequences that most existing approaches fail to efficiently model, and (4) transfers effectively to unseen arrhythmia types, underscoring its robust cross-domain generalization. We additionally evaluate SL-S4Wave on multiple EEG tasks, achieving superior performance over strong baselines, demonstrating generalizability of our approach beyond cardiac waveforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes SL-S4Wave, a self-supervised contrastive learning framework whose encoder augments structured state space models (S4) with multi-layer global convolution using multiscale subkernels. The central claims are that this architecture captures both fine-grained local patterns and long-range dependencies in noisy, high-resolution multichannel physiological waveforms (ECG, EEG) better than prior S4 or CNN-based SSL methods, yielding (1) superior arrhythmia detection performance, (2) strong label efficiency, (3) robustness on long waveform segments, (4) effective transfer to unseen arrhythmia types, and (5) superior results on multiple EEG tasks.

Significance. If the empirical results can be verified, the work would demonstrate a concrete architectural improvement to S4 models that addresses their documented shortcomings on multichannel physiological signals, offering a path toward more label-efficient and long-range-capable representations for clinical time-series data.

major comments (1)
  1. [Abstract] Abstract: the manuscript asserts 'extensive experiments' that demonstrate consistent outperformance, label efficiency, long-sequence robustness, cross-arrhythmia transfer, and EEG superiority, yet supplies no datasets, metrics, baseline implementations, ablation details, error bars, or statistical tests. Without these elements the central empirical claims cannot be evaluated and therefore cannot support the stated conclusions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. Below we address the major comment point by point.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the manuscript asserts 'extensive experiments' that demonstrate consistent outperformance, label efficiency, long-sequence robustness, cross-arrhythmia transfer, and EEG superiority, yet supplies no datasets, metrics, baseline implementations, ablation details, error bars, or statistical tests. Without these elements the central empirical claims cannot be evaluated and therefore cannot support the stated conclusions.

    Authors: The abstract is a high-level summary constrained by length limits and therefore omits specific implementation details. The full manuscript (Section 4 and appendices) supplies all requested elements: the exact datasets (with preprocessing and splits), evaluation metrics, baseline re-implementations (including code references where available), ablation studies on the multiscale subkernels and contrastive objectives, error bars from multiple random seeds, and statistical significance tests (paired t-tests or Wilcoxon). These results directly support the five claims listed in the abstract. We are happy to add one or two concrete dataset/metric examples to the abstract in a revision if the editor permits a modest length increase. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes an empirical SSL framework (SL-S4Wave) combining contrastive learning with a modified S4 encoder and reports performance gains on arrhythmia detection and EEG tasks. No equations, first-principles derivations, or predictions appear in the abstract or described claims. All results are framed as outcomes of experiments on real-world datasets rather than reductions to fitted inputs, self-citations, or self-definitional steps. The architecture description (multi-layer global convolution with multiscale subkernels) is presented as a design choice, not a derived necessity that collapses to prior results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient information in the abstract to identify specific free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5839 in / 1220 out tokens · 30373 ms · 2026-06-26T18:33:21.719268+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 3 canonical work pages

  1. [1]

    The aasm manual for the scoring of sleep and associated events.Rules, Terminology and Technical Specifications, Darien, Illinois, American Academy of Sleep Medicine, 176(2012):7,

    Richard B Berry, Rita Brooks, Charlene E Gamaldo, Susan M Harding, Carole Marcus, Bradley V Vaughn, et al. The aasm manual for the scoring of sleep and associated events.Rules, Terminology and Technical Specifications, Darien, Illinois, American Academy of Sleep Medicine, 176(2012):7,

  2. [2]

    doi: 10.5555/3454287.3455210

    Curran Associates, Inc. doi: 10.5555/3454287.3455210. URLhttps://papers.nips.cc/paper_files/paper/2019/file/ 2afc4dfb14e55c6face649a1d0c1025b-Paper.pdf. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pp. 1597–1607. PMLR,

  3. [3]

    The PhysioNet/Computing in Cardiology challenge 2015: reducing false arrhythmia alarms in the ICU

    Gari D Clifford, Ikaro Silva, Benjamin Moody, Qiao Li, Danesh Kella, Abdullah Shahin, Tristan Kooistra, Diane Perry, and Roger G Mark. The PhysioNet/Computing in Cardiology challenge 2015: reducing false arrhythmia alarms in the ICU. In2015 Computing in Cardiology Conference (CinC), pp. 273–276. IEEE,

  4. [4]

    ISBN 9781450383325

    Association for Computing Machinery. ISBN 9781450383325. doi: 10.1145/3447548.3467330. URLhttps: //doi.org/10.1145/3447548.3467330. Barbara J Drew, Patricia Harris, Jessica K Zègre-Hemsey, Tina Mammone, Daniel Schindler, Rebeca Salas- Boni, Yong Bai, Adelita Tinoco, Quan Ding, and Xiao Hu. Insights into the problem of alarm fatigue with physiologic monito...

  5. [5]

    Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396,

    14 Published in Transactions on Machine Learning Research (05/2026) Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396,

  6. [7]

    Haigen Hu, Xiaoyuan Wang, Yan Zhang, Qi Chen, and Qiu Guan

    URLhttp://arxiv.org/abs/1512.03385. Haigen Hu, Xiaoyuan Wang, Yan Zhang, Qi Chen, and Qiu Guan. A comprehensive survey on contrastive learning.Neurocomputing, 610:128645,

  7. [8]

    Foundation model trained on over ten million ECG recordings with expert-level performance across diagnostics

    doi: 10.1056/AIoa2401033. Foundation model trained on over ten million ECG recordings with expert-level performance across diagnostics. Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. What makes convolutional models great on long sequence modeling?ICLR,

  8. [9]

    15 Published in Transactions on Machine Learning Research (05/2026) Christopher McMaster, David FL Liew, and Douglas EV Pires

    URLhttps://openreview.net/forum?id= msJgEkjwh5. 15 Published in Transactions on Machine Learning Research (05/2026) Christopher McMaster, David FL Liew, and Douglas EV Pires. Adapting pretrained language models for solving tabular prediction problems in the electronic health record.arXiv preprint arXiv:2303.14920,

  9. [11]

    Modeling disease progression in retinal octs with longi- tudinal self-supervised learning

    Antoine Rivail, Ursula Schmidt-Erfurth, Wolf-Dieter Vogl, Sebastian M Waldstein, Sophie Riedl, Christoph Grechenig, Zhichao Wu, and Hrvoje Bogunovic. Modeling disease progression in retinal octs with longi- tudinal self-supervised learning. InPredictive Intelligence in Medicine: Second International Workshop, PRIME 2019, Held in Conjunction with MICCAI 20...

  10. [12]

    Simplified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933,

    Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933,

  11. [13]

    Ts2vec: Towards universal representation of time series

    16 Published in Transactions on Machine Learning Research (05/2026) Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. Ts2vec: Towards universal representation of time series. InProceedings of the AAAI conference on artificial intelligence, volume 36, pp. 8980–8987,

  12. [14]

    17 Published in Transactions on Machine Learning Research (05/2026) A Structured State-Space Model The state-space model is a classic model in control theory, and it represents the operational state of a system using first-order differential equations (ODE). A continuous state-space model can be defined in the following form: x′(t) =Ax(t) +Bu(t), y(t) =Cx...

  13. [15]

    Performance of SL-S4Wave on Longer Sequences

    Table 6 reports the downstream classification performance of SL-S4Wave and three alternative encoder architectures—FCN, ResNet18, and Transformer—trained with and without self-supervised pre-training on twodatasets(VTaCandMIMICII).TheTransformerusedinthisexperimentisthestandardimplementation of PyTorch. Its parameters are: d model is 512, head is 8, feed ...

  14. [16]

    Therefore, in the ablation experiments, we set different values ofdto investigate the effect of kernel size on model performance

    We recognized that the kernel size setting might have an impact on our model’s results. Therefore, in the ablation experiments, we set different values ofdto investigate the effect of kernel size on model performance. Table. C and Table. C show the experimental results on MIMIC II dataset and VTaC dataset. Overall, smaller values ofdlead to better perform...

  15. [17]

    We reported the average of 5 experiments

    We use the data from 10 seconds before the 26 Published in Transactions on Machine Learning Research (05/2026) d TPR TNR Score F1 PPV AUC Val Score Params 2 94.4050.47 65.21 77.59 65.75 86.05 71.5218m 16 93.62 44.95 61.52 75.53 63.23 83.25 70.55 24m 6494.6141.94 61.78 75.07 62.09 81.95 70.44 36m 128 93.97 41.86 60.71 74.73 61.90 80.86 70.16 48m 512 91.99 ...

  16. [18]

    We adopt the default settings provided by the TS2vec implementation for other settings

    We train for 40 epochs to slightly overfit it. We adopt the default settings provided by the TS2vec implementation for other settings. All experiments were conducted on an NVIDIA-V100 GPU. 2https://github.com/danikiyasseh/CLOCS 27 Published in Transactions on Machine Learning Research (05/2026) ECG-JEPA: We use learning rate of 1e-4, weight decay of 1e-4,...

  17. [19]

    Since the author did not provide pre-trained weights, we used our own pre-training dataset for pre-training

    Following the original setting, we train for 4000 iterations and apply early stopping after 1000 iterations. Since the author did not provide pre-trained weights, we used our own pre-training dataset for pre-training. During the pre-training process, we used a learning rate of 1e-4, a dropout rate of 0.1, and a weight decay of 1.0e-2. We trained for 100,0...

  18. [20]

    All experiments were conducted on an NVIDIA-A100 40G GPU

    Additionally, since our model includes CVP, Pleth, and ABP channels, we placed these two channels in the positions of V5 and V6. All experiments were conducted on an NVIDIA-A100 40G GPU. E Data Preprocessing For all datasets, each waveform recording contains six minutes of multi-channel physiological waveforms, with the arrhythmia alarm onset at the end o...

  19. [21]

    Regarding the fine-tuning task, an 8:1:1 split was applied to the VTaC labeled data for training, validation, and testing respectively

    In constructing the pre-train dataset, we extracted unlabeled waveform data corresponding to 17,640 VT alarm events from 1,949 patient records without any expert annotations inVTaCdataset. Regarding the fine-tuning task, an 8:1:1 split was applied to the VTaC labeled data for training, validation, and testing respectively. For the MIMIC II-VT dataset, an ...

  20. [22]

    ASY, EBR, ETC, VTA and VFB represent Asystole, Extreme Bradycardia, Extreme Tachycardia, Ventricular Tachycardia, and Ventricular Flutter/Fibrillation

    Alarm types F ALSE TRUE ASY 100 20 EBR 45 45 ETC 8 131 VTA 253 90 VFB 52 6 Total 458 292 Table 15: Detailed distribution of various alarms in the Challenge 2015 dataset. ASY, EBR, ETC, VTA and VFB represent Asystole, Extreme Bradycardia, Extreme Tachycardia, Ventricular Tachycardia, and Ventricular Flutter/Fibrillation. Alarm Types True Alarms False Alarm...

  21. [23]

    For the three datasets, we utilized min-max normalization to process the data

    Due to the different value ranges of different channels, for instance, ECG channels often fall between [-2,2], while ABP channels are between [40,180], we need to process the data to enable deep learning models to perform gradient descent effectively. For the three datasets, we utilized min-max normalization to process the data. Specifically, for each rec...

  22. [24]

    The dataset is annotated according to the American Academy of Sleep Medicine (AASM) standard (Berry et al., 2012), with five sleep stages: Wake,NREM1 (N1),NREM (N2),NREM (N3), andREM. Each EEG segment corresponds to a 30-second epoch.ISRUC_S3is a smaller dataset comprising recordings from 10 subjects, also sampled at 200 Hz with six channels, totaling 8,5...

  23. [25]

    no stress

    It contains recordings from 36 subjects under two distinct cognitive conditions:restingandactive engagementin mental arithmetic. EEG data labeled as “no stress” correspond to resting periods prior to the task, while “stress” labels are assigned to recordings during task performance. The signals were acquired using 20 electrodes placed according to the int...

  24. [26]

    a notch filter to eliminate power line interference. For PPG signal, we utilize a high-pass filter with a stopband frequency of 0.3 Hz and a passband frequency of 0.5 Hz, along with a low-pass filter with a passband frequency of 5 Hz and a stopband frequency of 8 Hz. Figure 7 shows an example of using a filter to eliminate high-frequency noise. H Example ...

  25. [27]

    Figure (b) shows an example false VT alarm – the event corresponds to an episode of atrial fibrillation with rate-related aberration instead of a ventricular tachycardia. H.2 Premonition Example: Abnormal Precursor Prior to Alarm Onset We illustrate in Figure 9 an example where an abnormal waveform pattern appears before the 10-second window immediately p...