pith. sign in

arxiv: 2606.09605 · v1 · pith:2SNT2E6Knew · submitted 2026-06-08 · 💻 cs.AI

Next-Token Prediction Learns Generalisable Representations of Sleep Physiology

Pith reviewed 2026-06-27 16:40 UTC · model grok-4.3

classification 💻 cs.AI
keywords next-token predictionfoundation modelsleep stagingphysiological signalsself-supervised learningpolysomnographyatrial fibrillationmulti-modal representations
0
0 comments X

The pith

Next-token prediction on tokenized multi-modal sleep signals produces embeddings that match supervised baselines with 100 times less labeled data and generalize to daytime atrial fibrillation detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains Hypnos, a foundation model, by tokenizing eight physiological modalities from over 20,000 overnight recordings and using next-token prediction to learn joint representations. This approach is presented as a scalable alternative to masked reconstruction or contrastive learning for signals whose semantic invariances remain poorly defined. On sleep stage classification the resulting embeddings reach the accuracy of strong supervised models while using far less labeled data. The same embeddings also outperform a dedicated ECG foundation model when applied to daytime recordings for atrial fibrillation detection. The work therefore claims that next-token prediction alone suffices to extract generalizable features from stochastic multi-modal physiological data.

Core claim

Hypnos tokenizes each of eight sensing modalities with residual vector quantization, then trains a large auto-regressive RQ-Transformer to predict the next token across all modalities in parallel. After pretraining, the model produces embeddings from any supported subset of modalities that match supervised sleep-stage baselines on held-out test sets while using 100 times less labeled data and that surpass a dedicated ECG foundation model at detecting atrial fibrillation in daytime physiology.

What carries the argument

An auto-regressive RQ-Transformer trained with next-token prediction on parallel streams of residual-vector-quantized tokens drawn from multiple physiological modalities.

If this is right

  • Sleep stage classification matches strong supervised baselines on held-out sets while using 100 times less labeled data.
  • The same embeddings surpass a dedicated ECG foundation model at atrial fibrillation detection from daytime recordings.
  • Embeddings can be generated from continuous streams of any supported subset of the eight modalities.
  • Next-token prediction serves as a scalable self-supervised objective for multi-modal physiological signals where positive-pair definitions are difficult to specify.
  • The approach outperforms existing foundation models across the reported benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may transfer to other stochastic multi-modal medical signals where explicit invariance definitions are unavailable.
  • Lower labeled-data requirements could accelerate model development for rare sleep or cardiac conditions.
  • Joint next-token prediction across modalities may encode cross-signal relationships that single-modality pretraining misses.
  • Further downstream tasks such as sleep-apnea event detection would provide additional tests of the claimed generality.

Load-bearing premise

That next-token prediction on tokenized physiological streams will automatically learn the semantic invariances required for downstream generalization without any explicit positive-pair or reconstruction targets.

What would settle it

On a held-out sleep staging test set, embeddings from the next-token model fail to reach the accuracy of a supervised baseline trained on the full labeled set when only one percent of the labels are supplied, or the model underperforms the dedicated ECG foundation model on atrial fibrillation detection from daytime recordings.

Figures

Figures reproduced from arXiv: 2606.09605 by Jonathan F. Carter, Lionel Tarassenko.

Figure 1
Figure 1. Figure 1: Overview. Hypnos is a large auto-regressive RQ-transformer trained via multi-modal next￾token prediction on tokenized streams of physiological sensor data. During pre-training, cross-modal attention is restricted to randomly sampled sub-groups, improving test-time generalisation to subsets of supported modalities. After pre-training, Hypnos can be used to generate high-quality embeddings for a diverse rang… view at source ↗
Figure 2
Figure 2. Figure 2: For stochastic signals such as an ECG (left), the distribution over the future may be [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Tokenizer training. Each sig￾nal Xi is encoded with RVQ into discrete residual tokens V i . Encoder and decoder are trained jointly to reconstruct Xi . For each modality, we train tokenizers to transform the continuous raw signals into discrete tokens. We do this using an encoder-decoder architecture with a residual vector quantization (RVQ) layer, previ￾ously used by several foundation models, including f… view at source ↗
Figure 4
Figure 4. Figure 4: Hypnos training. (a) For each time t, the K discrete residual tokens (illustrated with K = 4) from modality i are combined to form an embedding mi t . (b) A Transformer backbone mixes information to produce embeddings z i t for each modality, e.g. i ∈ {A, B, C, D}. (c) For all i, t, the Depth Transformer auto-regressively predicts the next residual token V i t+1,k conditioned on z i t . 3.3 Hypnos After to… view at source ↗
Figure 5
Figure 5. Figure 5: Example cross-modal attention matrices (M = 4). Dur￾ing training, attention is restricted to random sub-groups. We split modalities into groups by sampling from a Chinese Restaurant Process [2] with concentration parameter α. Modal￾ities are assigned sequentially: modality i + 1 joins an existing group g of size ng with probability ng/(i + α) and starts a new group with probability α/(i + α). The resulting… view at source ↗
Figure 6
Figure 6. Figure 6: Few-shot sleep stage classification. We train MLP probes on each foundation model and re-train supervised baselines using varying fractions of in-domain data. Using as little as 1% of the probe-training data, Hypnos matches U-Sleep trained on the full dataset on held-out MrOS. 4.4 Transfer to external ECG benchmarks [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Autoregressive generation of physiological signals. Hypnos can be used to jointly generate physiological signals for any subset of supported modalities. Here we see that conditioned on 10 s of real context (blue), Hypnos generates plausible signals with cross-modal consistency. For example, we can observe respiration-induced amplitude modulation of R-peaks in the ECG. 4.6 Modality-masking We ablate our gro… view at source ↗
Figure 8
Figure 8. Figure 8: Scaling model size improves the performance of Hypnos. Next-token perplexity and downstream metrics all improve with model scale. Sleep stage classification, apnoea detection and arousal detection performance are reported using a linear probe on the SHHS validation set. 5 Limitations and Future Work Improving sensor generalisation We demonstrated that our method enables held-out generalisation to subsets o… view at source ↗
Figure 9
Figure 9. Figure 9: Effect of input and output residual depth on downstream performance. Linear probe metrics for unimodal EEG models on the SHHS validation set varying Kin and Kout. Performance slightly improves when increasing the number of input residual tokens but worsens when increasing the number of output residual tokens. Tokenization length In our main experiments, we designed our tokenizers to produce tokens at a rat… view at source ↗
Figure 10
Figure 10. Figure 10: Effect of token duration on downstream performance. (left) Reconstruction quality decreases as token duration is varied from 0.25 s to 5 s, i.e. the compression rate increases. However, performance is worst at high token rates (0.25 s) and saturates or regresses beyond 1 s. We adopt a 1 s token duration in all other experiments. Adversarial losses Défossez et al. [16] recently observed that removing recon… view at source ↗
Figure 11
Figure 11. Figure 11: Scaling unimodal EEG models from Tiny to Large. Next-token perplexity and downstream metrics continue to improve with model scale. Sleep stage classification, apnoea detection and arousal detection performance are reported using a linear probe on the SHHS validation set. B.3 Scaling context length A key motivation for next-token prediction is that it naturally scales to longer context lengths. To quantify… view at source ↗
Figure 12
Figure 12. Figure 12: Effect of context length on single-channel EEG models. Validation perplexity decreases and downstream probing performance improves as the training context length is increased from 128 to 4096 tokens. Sleep staging and arousal detection saturate at around 1024–2048 tokens, while age regression, CVD risk and moderate OSA detection continue to improve at the longest context lengths. (a) Validation perplexity… view at source ↗
Figure 13
Figure 13. Figure 13: Effect of context length on ECG-only models. As with EEG, perplexity and downstream metrics improve with longer context. The largest relative gains are again on summary tasks such as moderate OSA detection and CVD risk. B.4 Modality-masking ablation [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
read the original abstract

Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with broad applications across sleep medicine, cardiology, neurology and other healthcare domains. Existing models have typically been trained with masked-reconstruction or contrastive objectives. However, masked reconstruction may be poorly suited to the stochastic nature of these signals, while contrastive approaches rely on positive-pair definitions despite the semantic invariances of physiological signals being poorly understood. In this work, we show that next-token prediction is a simple and scalable alternative. We develop Hypnos, a multi-modal sleep foundation model trained using eight different sensing modalities (e.g. EEG, ECG, respiratory signals) drawn from over 20,000 overnight polysomnography recordings. We tokenize each modality into streams of discrete tokens using residual vector quantization, then train a large auto-regressive RQ-Transformer to jointly predict the next token across all modalities in parallel. After training, Hypnos can be applied to continuous streams of sensor data from any subset of supported modalities, generating embeddings for downstream tasks. Across a range of benchmarks, Hypnos significantly outperforms existing foundation models. In sleep stage classification, we match the performance of strong supervised baselines on held-out test sets whilst using \(100\times\) less labelled data. Hypnos even generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation. Our results demonstrate that next-token prediction is a strong self-supervised objective for representation learning from multi-modal physiological signals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Hypnos, a multi-modal foundation model for sleep physiology trained via next-token prediction. It tokenizes eight sensing modalities (EEG, ECG, respiratory signals, etc.) from over 20,000 overnight PSG recordings using residual vector quantization, then trains a large auto-regressive RQ-Transformer to jointly predict the next token across modalities. The central claims are that this objective yields representations that (i) significantly outperform existing foundation models, (ii) match strong supervised baselines on held-out sleep-stage classification while using 100× less labeled data, and (iii) generalize to daytime single-channel ECG, surpassing a dedicated ECG foundation model on atrial-fibrillation detection.

Significance. If the empirical claims hold after proper controls, the result would be significant: it supplies concrete evidence that a simple autoregressive objective can induce useful invariances in stochastic physiological time series without requiring contrastive positive-pair definitions or masked reconstruction. The scale (20 k+ multi-modal recordings) and the reported data-efficiency gain in sleep staging would strengthen the case for next-token prediction as a scalable pre-training strategy in healthcare signal modeling.

major comments (1)
  1. [Abstract] Abstract (generalization claim): The assertion that Hypnos 'generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation' is load-bearing for the cross-domain claim. The manuscript provides no quantification of domain shift (circadian, activity, or recording-context differences), no statement on whether the daytime evaluation used zero-shot embeddings or fine-tuning, and no comparison of the dedicated ECG model's training data volume or diversity. Without these controls it is impossible to attribute outperformance to the next-token objective or multi-modal pre-training rather than architecture or data scale.
minor comments (1)
  1. [Abstract] Abstract: No experimental details (data splits, metrics, error bars, or whether results are from linear probing vs. fine-tuning) are supplied, making it difficult for readers to assess the strength of the reported gains at first reading.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for identifying areas where the generalization claim requires additional support. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (generalization claim): The assertion that Hypnos 'generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation' is load-bearing for the cross-domain claim. The manuscript provides no quantification of domain shift (circadian, activity, or recording-context differences), no statement on whether the daytime evaluation used zero-shot embeddings or fine-tuning, and no comparison of the dedicated ECG model's training data volume or diversity. Without these controls it is impossible to attribute outperformance to the next-token objective or multi-modal pre-training rather than architecture or data scale.

    Authors: We agree that the abstract is too concise to convey these details and that the manuscript as submitted lacks explicit quantification of domain shift and direct comparisons of the baseline model's data. In revision we will (i) expand the abstract to state that daytime AF detection uses fine-tuned embeddings on a modest amount of labeled daytime ECG, (ii) add a dedicated paragraph or short subsection that reports basic distributional statistics (e.g., heart-rate variability, signal amplitude) between the overnight PSG and daytime recordings and discusses circadian/activity differences, and (iii) include a brief comparison of the published training scale and diversity of the dedicated ECG foundation model. These additions will make the attribution to the next-token objective clearer while remaining within the scope of the existing experiments. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation is self-contained

full rationale

The paper trains a multi-modal RQ-Transformer with a standard next-token prediction objective on tokenized overnight PSG recordings (eight modalities) and then extracts embeddings for separate downstream evaluations on held-out sleep staging and daytime ECG AF detection tasks. No equations or claims reduce the objective or results to the evaluation metrics by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or uniqueness theorems are invoked. The pretraining objective and downstream tasks remain independent, making the reported generalization a genuine empirical claim rather than a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on model architecture specifics, tokenization parameters, or training hyperparameters, so no free parameters or axioms can be identified.

pith-pipeline@v0.9.1-grok · 5791 in / 969 out tokens · 24395 ms · 2026-06-27T16:40:39.545108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    Miller, Saba Emrani, Udhyakumar Nallasamy, and Ian Shapiro

    Salar Abbaspourazad, Oussama Elachqar, Andrew C. Miller, Saba Emrani, Udhyakumar Nallasamy, and Ian Shapiro. Large-scale Training of Foundation Models for Wearable Biosignals. InThe Twelfth International Conference on Learning Representations, March 2024. doi: 10.48550/arXiv.2312.05409. 1

  2. [2]

    David J. Aldous. Exchangeability and related topics. In David J. Aldous, Illdar A. Ibragimov, Jean Jacod, and P. L. Hennequin, editors,École d’Été de Probabilités de Saint-Flour XIII — 1983, pages 1–198, Berlin, Heidelberg, 1985. Springer. ISBN 978-3-540-39316-0. doi: 10.1007/BFb0099421. 6

  3. [3]

    Behavioral Timescale Synaptic Plasticity: A Burst in the Field of Learning and Memory

    Thomas Andrillon, Yuval Nir, Richard J. Staba, Fabio Ferrarelli, Chiara Cirelli, Giulio Tononi, and Itzhak Fried. Sleep Spindles in Humans: Insights from Intracranial EEG and Unit Recordings.The Journal of Neuroscience, 31(49):17821–17834, December 2011. ISSN 0270-6474. doi: 10.1523/JNEUROSCI. 2604-11.2011. 18

  4. [4]

    Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 17

  5. [5]

    Uncovering the structure of clinical EEG signals with self-supervised learning.Journal of Neural Engineering, 18(4):046020, 2021

    Hubert Banville, Omar Chehab, Aapo Hyvärinen, Denis-Alexander Engemann, and Alexandre Gram- fort. Uncovering the structure of clinical EEG signals with self-supervised learning.Journal of Neural Engineering, 18(4):046020, 2021. 1

  6. [6]

    Peters, and Arman Cohan

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Transformer, December 2020. 5

  7. [7]

    Audiolm: A language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31: 2523–2533, 2023

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, and Marco Tagliasacchi. Audiolm: A language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31: 2523–2533, 2023. 1, 2

  8. [8]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  9. [9]

    Emerging Properties in Self-Supervised Vision Transformers, May 2021

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging Properties in Self-Supervised Vision Transformers, May 2021. 17

  10. [10]

    Carter and Lionel Tarassenko

    Jonathan F. Carter and Lionel Tarassenko. Wav2sleep: A Unified Multi-Modal Approach to Sleep Stage Classification from Physiological Signals. InProceedings of the 4th Machine Learning for Health Symposium, pages 186–202. PMLR, February 2025. 6

  11. [11]

    Lutsey, Sogol Javaheri, Carmela Alcántara, Chandra L

    Xiaoli Chen, Rui Wang, Phyllis Zee, Pamela L. Lutsey, Sogol Javaheri, Carmela Alcántara, Chandra L. Jackson, Michelle A. Williams, and Susan Redline. Racial/Ethnic Differences in Sleep Disturbances: The Multi-Ethnic Study of Atherosclerosis (MESA).Sleep, 38(6):877–888, June 2015. ISSN 1550-9109. doi: 10.5665/sleep.4732. 3

  12. [12]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Xinlei Chen and Kaiming He. Exploring Simple Siamese Representation Learning. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15745–15753, Nashville, TN, USA, June 2021. IEEE. ISBN 978-1-6654-4509-2. doi: 10.1109/CVPR46437.2021.01549. 3

  13. [13]

    Simple and Controllable Music Generation.Advances in Neural Information Processing Systems, 36:47704–47720, December 2023

    Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Defossez. Simple and Controllable Music Generation.Advances in Neural Information Processing Systems, 36:47704–47720, December 2023. 2

  14. [14]

    Imant Daunhawer, Alice Bizeul, Emanuele Palumbo, Alexander Marx, and Julia E. V ogt. Identifiability Results for Multimodal Contrastive Learning, March 2023. 3

  15. [15]

    Kyle, and Lionel Tarassenko

    Shaun Davidson, Rachel Sharman, Simon D. Kyle, and Lionel Tarassenko. Is it time to revisit the scoring of slow wave (N3) sleep?Sleep, 48(10), October 2025. ISSN 0161-8105. doi: 10.1093/sleep/zsaf063. 24

  16. [16]

    Moshi: A speech-text foundation model for real-time dialogue, October 2024

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: A speech-text foundation model for real-time dialogue, October 2024. 1, 2, 4, 5, 15, 19 11

  17. [17]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, October 2020. 5

  18. [18]

    A foundational transformer leveraging full night, multichannel sleep study data accurately classifies sleep stages.Sleep, 48(8):zsaf061, August 2025

    Benjamin Fox, Joy Jiang, Sajila Wickramaratne, Patricia Kovatch, Mayte Suarez-Farinas, Neomi A Shah, Ankit Parekh, and Girish N Nadkarni. A foundational transformer leveraging full night, multichannel sleep study data accurately classifies sleep stages.Sleep, 48(8):zsaf061, August 2025. ISSN 0161-8105. doi: 10.1093/sleep/zsaf061. 1, 3

  19. [19]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. https://arxiv.org/abs/2403.05530v5, March 2024. 1

  20. [20]

    During, and Valentin Thorey

    Antoine Guillot, Fabien Sauvet, Emmanuel H. During, and Valentin Thorey. Dreem Open Datasets: Multi-Scored Sleep Datasets to Compare Human and Automated Sleep Staging.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 28(9):1955–1965, September 2020. ISSN 1558-0210. doi: 10.1109/TNSRE.2020.3011181. 4

  21. [21]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems (NeurIPS 2020). arXiv, December 2020. doi: 10.48550/arXiv. 2006.11239. 3

  22. [22]

    C. Iber. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology, and Technical Specification. 2007. 3

  23. [23]

    The Brain’s Bitter Lesson: Scaling Speech Decoding With Self-Supervised Learning

    Dulhan Jayalath, Gilad Landau, Brendan Shillingford, Mark Woolrich, and Oiwi Parker Jones. The Brain’s Bitter Lesson: Scaling Speech Decoding With Self-Supervised Learning. InProceedings of the 42nd International Conference on Machine Learning, June 2025. 3

  24. [24]

    Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765, 2024

    Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu. Large Brain Model for Learning Generic Representa- tions with Tremendous EEG Data in BCI. InInternational Conference on Learning Representations (ICLR 2024), May 2024. doi: 10.48550/arXiv.2405.18765. 2

  25. [25]

    NeuroLM: A Universal Multi-task Foun- dation Model for Bridging the Gap between Language and EEG Signals

    Weibang Jiang, Yansen Wang, Bao-liang Lu, and Dongsheng Li. NeuroLM: A Universal Multi-task Foun- dation Model for Bridging the Gap between Language and EEG Signals. InThe Thirteenth International Conference on Learning Representations, October 2024. 2

  26. [26]

    Biing-Hwang Juang and A. Gray. Multiple stage vector quantization for speech coding. InICASSP ’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 7, pages 597–600, May 1982. doi: 10.1109/ICASSP.1982.1171604. 4

  27. [27]

    Dani Kiyasseh, Tingting Zhu, and David A. Clifton. CLOCS: Contrastive Learning of Cardiac Signals Across Space, Time, and Patients, May 2021. 3

  28. [28]

    Autoregressive Image Generation using Residual Quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive Image Generation using Residual Quantization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022). arXiv, March 2022. doi: 10.48550/arXiv.2203.01941. 2, 5

  29. [29]

    Splaingard, Yungui Huang, Yuejie Chi, and Simon L

    Harlin Lee, Boyue Li, Shelly DeForte, Mark L. Splaingard, Yungui Huang, Yuejie Chi, and Simon L. Linwood. A large collection of real-world pediatric sleep studies.Scientific Data, 9(1):421, July 2022. ISSN 2052-4463. doi: 10.1038/s41597-022-01545-6. 3

  30. [30]

    Simon A. Lee, Cyrus Tanade, Hao Zhou, Juhyeon Lee, Megha Thukral, Minji Han, Rachel Choi, Md Saz- zad Hissain Khan, Baiying Lu, Migyeong Gwak, Mehrab Bin Morshed, Viswam Nathan, Md Mahbubur Rahman, Li Zhu, Subramaniam Venkatraman, and Sharanya Arcot Desai. HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Seri...

  31. [31]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In7th International Confer- ence on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. 5

  32. [32]

    BenchECG and xECG: A benchmark and baseline for ECG foundation models, September 2025

    Riccardo Lunelli, Angus Nicolson, Samuel Martin Pröll, Sebastian Johannes Reinstadler, Axel Bauer, and Clemens Dlaska. BenchECG and xECG: A benchmark and baseline for ECG foundation models, September 2025. 8, 24 12

  33. [33]

    Marcus, Reneé H

    Carole L. Marcus, Reneé H. Moore, Carol L. Rosen, Bruno Giordani, Susan L. Garetz, H. Gerry Taylor, Ron B. Mitchell, Raouf Amin, Eliot S. Katz, Raanan Arens, Shalini Paruthi, Hiren Muzumdar, David Gozal, Nina Hattiangadi Thomas, Janice Ware, Dean Beebe, Karen Snyder, Lisa Elden, Robert C. Sprecher, Paul Willging, Dwight Jones, John P. Bent, Timothy Hoban,...

  34. [34]

    Hoos, and James J

    Julieta Martinez, Holger H. Hoos, and James J. Little. Stacked Quantizers for Compositional Vector Compression, November 2014. 4

  35. [35]

    ECG-FM: An Open Electrocardiogram Foundation Model, May 2025

    Kaden McKeen, Sameer Masood, Augustin Toma, Barry Rubin, and Bo Wang. ECG-FM: An Open Electrocardiogram Foundation Model, May 2025. 2

  36. [36]

    McSharry, G.D

    P.E. McSharry, G.D. Clifford, L. Tarassenko, and L.A. Smith. A dynamical model for generating synthetic electrocardiogram signals.IEEE Transactions on Biomedical Engineering, 50(3):289–294, March 2003. ISSN 1558-2531. doi: 10.1109/TBME.2003.808805. 2

  37. [37]

    Sejnowski

    Lyle Muller, Frédéric Chavane, John Reynolds, and Terrence J. Sejnowski. Cortical travelling waves: Mechanisms and computational principles.Nature Reviews Neuroscience, 19(5):255–268, May 2018. ISSN 1471-0048. doi: 10.1038/nrn.2018.20. 10

  38. [38]

    Scaling Wearable Foundation Models

    Girish Narayanswamy, Xin Liu, Kumar Ayush, Yuzhe Yang, Xuhai Xu, Shun Liao, Jake Garrison, Shyam Tailor, Jake Sunshine, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Samy Abdel-Ghaffar, and Daniel McDuff. Scaling Wearable Foundation Models. InThe Thirteenth International Conference on Learning Repre...

  39. [39]

    U- Sleep: Resilient high-frequency sleep staging.npj Digital Medicine, 4(1):72, April 2021

    Mathias Perslev, Sune Darkner, Lykke Kempfner, Miki Nikolic, Poul Jørgen Jennum, and Christian Igel. U- Sleep: Resilient high-frequency sleep staging.npj Digital Medicine, 4(1):72, April 2021. ISSN 2398-6352. doi: 10.1038/s41746-021-00440-5. 6, 8, 22, 23

  40. [40]

    Chén, Philipp Koch, Alfred Mertins, and Maarten De V os

    Huy Phan, Kaare Mikkelsen, Oliver Y . Chén, Philipp Koch, Alfred Mertins, and Maarten De V os. Sleep- Transformer: Automatic Sleep Staging With Interpretability and Uncertainty Quantification.IEEE Transactions on Biomedical Engineering, 69(8):2456–2467, August 2022. ISSN 1558-2531. doi: 10.1109/TBME.2022.3147187. 6, 7, 8, 22, 23

  41. [41]

    PaPaGei: Open Foundation Models for Optical Physiological Signals, February 2025

    Arvind Pillai, Dimitris Spathis, Fahim Kawsar, and Mohammad Malekzadeh. PaPaGei: Open Foundation Models for Optical Physiological Signals, February 2025. 3

  42. [42]

    S. F. Quan, B. V . Howard, C. Iber, J. P. Kiley, F. J. Nieto, G. T. O’Connor, D. M. Rapoport, S. Redline, J. Robbins, J. M. Samet, and P. W. Wahl. The Sleep Heart Health Study: Design, rationale, and methods. Sleep, 20(12):1077–1085, December 1997. ISSN 0161-8105. 3

  43. [43]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. 1

  44. [44]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019. 1

  45. [45]

    Redline, P

    S. Redline, P. V . Tishler, T. D. Tosteson, J. Williamson, K. Kump, I. Browner, V . Ferrette, and P. Krejci. The familial aggregation of obstructive sleep apnea.American Journal of Respiratory and Critical Care Medicine, 151(3 Pt 1):682–687, March 1995. ISSN 1073-449X. doi: 10.1164/ajrccm/151.3_Pt_1.682. 3

  46. [46]

    Rosen, Emma K

    Carol L. Rosen, Emma K. Larkin, H. Lester Kirchner, Judith L. Emancipator, Sarah F. Bivins, Susan A. Surovec, Richard J. Martin, and Susan Redline. Prevalence and risk factors for sleep-disordered breathing in 8- to 11-year-old children: Association with race and prematurity.The Journal of Pediatrics, 142(4): 383–389, April 2003. ISSN 0022-3476. doi: 10.1...

  47. [47]

    Schmidt, Claudio L

    Alvise Dei Rossi, Matteo Metaldi, Michal Bechny, Irina Filchenko, Julia van der Meer, Markus H. Schmidt, Claudio L. A. Bassetti, Athina Tzovara, Francesca D. Faraci, and Luigi Fiorillo. SLEEPYLAND: Trust begins with fair evaluation of automatic sleep staging models.npj Digital Medicine, 9(1):55, December

  48. [48]

    doi: 10.1038/s41746-025-02237-2

    ISSN 2398-6352. doi: 10.1038/s41746-025-02237-2. 6

  49. [49]

    Continuous Audio Language Models

    Simon Rouard, Manu Orsini, Axel Roebel, Neil Zeghidour, and Alexandre Défossez. Continuous Audio Language Models. InThe Fourteenth International Conference on Learning Representations, January

  50. [50]

    doi: 10.48550/arXiv.2509.06926. 3

  51. [51]

    Improved Techniques for Training GANs, June 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved Techniques for Training GANs, June 2016. 19 13

  52. [52]

    OSF: On pre-training and scaling of sleep foundation models.arXiv preprint arXiv:2603.00190, 2026

    Zitao Shuai, Zongzhe Xu, David Yang, Wei Wang, and Yuzhe Yang. OSF: On Pre-training and Scaling of Sleep Foundation Models. InProceedings of the 43rd International Conference on Machine Learning, February 2026. doi: 10.48550/arXiv.2603.00190. 2, 3, 6, 8, 17, 22, 23

  53. [53]

    Weiss, Niru Maheswaranathan, and Surya Ganguli

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics, November 2015. 3

  54. [54]

    Stone, and Osteoporotic Fractures in Men (MrOS) Study Group

    Yeonsu Song, Terri Blackwell, Kristine Yaffe, Sonia Ancoli-Israel, Susan Redline, Katie L. Stone, and Osteoporotic Fractures in Men (MrOS) Study Group. Relationships between sleep stages and changes in cognitive function in older men: The MrOS Sleep Study.Sleep, 38(3):411–421, March 2015. ISSN 1550-9109. doi: 10.5665/sleep.4500. 3

  55. [55]

    Roformer: Enhanced transformer with rotary position embedding,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 568:127063, February 2024. ISSN 0925-2312. doi: 10.1016/j.neucom.2023.127063. 16

  56. [56]

    SEANet: A Multi-modal Speech Enhancement Network, October 2020

    Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, and Dominik Roblek. SEANet: A Multi-modal Speech Enhancement Network, October 2020. 4

  57. [57]

    Brandon Westover, Poul Jennum, Andreas Brink-Kjaer, Emmanuel Mignot, and James Zou

    Rahul Thapa, Magnus Ruud Kjaer, Bryan He, Ian Covert, Hyatt Moore IV , Umaer Hanif, Gauri Ganjoo, M. Brandon Westover, Poul Jennum, Andreas Brink-Kjaer, Emmanuel Mignot, and James Zou. A multimodal sleep foundation model for disease prediction.Nature Medicine, 32(2):752–762, February

  58. [58]

    doi: 10.1038/s41591-025-04133-4

    ISSN 1546-170X. doi: 10.1038/s41591-025-04133-4. 1, 2, 3, 6, 8, 16, 17, 22, 23, 24

  59. [59]

    WaveNet: A Generative Model for Raw Audio, September 2016

    Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio, September 2016. 4

  60. [60]

    Neural Discrete Representation Learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural Discrete Representation Learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. 2

  61. [61]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need.arXiv:1706.03762 [cs], December 2017. 4, 5

  62. [62]

    Eegpt: Pretrained transformer for universal and reliable representation of eeg signals.Advances in Neural Information Processing Systems, 37:39249–39280, 2024

    Guangyu Wang, Wenchao Liu, Yuhong He, Cong Xu, Lin Ma, and Haifeng Li. Eegpt: Pretrained transformer for universal and reliable representation of eeg signals.Advances in Neural Information Processing Systems, 37:39249–39280, 2024. 3

  63. [63]

    BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals

    Qinfan Xiao, Ziyun Cui, Chi Zhang, Siqi Chen, Wen Wu, Andrew Thwaites, Alexandra Woolgar, Bowen Zhou, and Chao Zhang. BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals. In Advances in Neural Information Processing Systems, volume 38, October 2025. doi: 10.48550/arXiv.2505. 18185. 2, 4, 10, 15, 16, 18

  64. [64]

    Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A

    Maxwell A. Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A. Tailor, Ahmed Metwally, A. Ali Heydari, Yuwei Zhang, Jake Garrison, Samy Abdel-Ghaffar, Xuhai Xu, Ken Gu, Jacob Sunshine, Ming-Zher Poh, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Yuzhe Yang, James M. Rehg, Xin Liu, and Dani...

  65. [65]

    Peppard, F

    Terry Young, Mari Palta, Jerome Dempsey, Paul E. Peppard, F. Javier Nieto, and K. Mae Hla. Burden of sleep apnea: Rationale, design, and major findings of the Wisconsin Sleep Cohort study.WMJ: official publication of the State Medical Society of Wisconsin, 108(5):246–249, August 2009. ISSN 1098-1861. 3

  66. [66]

    Creagh, Catherine Tong, Aidan Acquah, David A

    Hang Yuan, Shing Chan, Andrew P. Creagh, Catherine Tong, Aidan Acquah, David A. Clifton, and Aiden Doherty. Self-supervised learning for human activity recognition using 700,000 person-days of wearable data.npj Digital Medicine, 7(1):1–10, April 2024. ISSN 2398-6352. doi: 10.1038/s41746-024-01062-3. 1, 3

  67. [67]

    Sleep2vec: Unified Cross-Modal Alignment for Heterogeneous Nocturnal Biosignals

    Weixuan Yuan, Zengrui Jin, Yichen Wang, Donglin Xie, Ziyi Ye, Chao Zhang, and Xuesong Chen. Sleep2vec: Unified Cross-Modal Alignment for Heterogeneous Nocturnal Biosignals. https://arxiv.org/abs/2602.13857v1, February 2026. 6, 8, 16, 17, 22, 23

  68. [68]

    Brant-X: A Unified Physiological Signal Alignment Framework

    Daoze Zhang, Zhizhang Yuan, Junru Chen, Kerui Chen, and Yang Yang. Brant-X: A Unified Physiological Signal Alignment Framework. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4155–4166, Barcelona Spain, August 2024. ACM. ISBN 979-8-4007- 0490-1. doi: 10.1145/3637528.3671953. 3

  69. [69]

    The National Sleep Research Resource: Towards a sleep data commons.Journal of the American Medical Informatics Association: JAMIA, 25(10):1351–1358, October

    Guo-Qiang Zhang, Licong Cui, Remo Mueller, Shiqiang Tao, Matthew Kim, Michael Rueschman, Sara Mariani, Daniel Mobley, and Susan Redline. The National Sleep Research Resource: Towards a sleep data commons.Journal of the American Medical Informatics Association: JAMIA, 25(10):1351–1358, October

  70. [70]

    Outcomes of Sleep Disorders in Older Men,

    ISSN 1527-974X. doi: 10.1093/jamia/ocy064. 3 14 A Additional Implementation Details A.1 Preprocessing Referencing and filteringEEG and EOG channels were re-referenced against the contralateral mastoid (C3–M2, C4–M1 for EEG; E1–M2, E2–M1 for EOG). Chin EMG was derived bipolarly from the chin electrode pair. ECG and respiratory effort (ABD, THX) were used d...