arxiv: 2605.09905 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging

Guisong Liu , Xin Gao , Martin Dresler , Jiansong Zhang , Pengfei Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sleep stagingtransformerrandom initializationinductive biassequence smoothinglocal temporal continuityattention mechanismphysiological monitoring

0 comments

The pith

A randomly initialized Transformer improves sleep staging by acting as an adaptive smoother without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the common assumption that Transformers succeed in sleep staging by learning complex long-range dependencies. It shows instead that sleep sequences exhibit strong local temporal continuity, which a random Transformer exploits to smooth the output adaptively and outperform simple heuristic smoothing. This indicates that most performance gains come from the architecture's built-in inductive bias rather than from trained parameters. The finding points to simpler structure-driven methods as sufficient for effective sleep analysis.

Core claim

A randomly initialized Transformer, without any training, substantially improves sleep staging performance and consistently outperforms heuristic smoothing. The effect is formalized via the Random Attention Prior Kernel showing that random self-attention acts as an adaptive smoother by balancing global averaging and content-based similarity while preserving stage transitions. Using the Local Smoothness Influence Index and Weighted Transition Entropy metrics, most performance gains in Transformer-based sleep staging arise from architectural inductive bias rather than parameter learning.

What carries the argument

The Random Attention Prior Kernel (RAPK), which formalizes how random self-attention functions as an adaptive sequence smoother for data with local temporal continuity.

If this is right

Sleep staging can be performed effectively with untrained Transformers that rely on architectural smoothing bias.
Most gains in Transformer sleep staging models stem from local continuity exploitation rather than complex learned dependencies.
Efficient, low-compute implementations become viable for large-scale physiological monitoring without parameter training.
Heuristic smoothing methods are outperformed by the content-aware balancing in random attention.
Structure-driven smoothing mechanisms suffice for sequential data with strong local continuity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The smoothing effect may generalize to other physiological time series that share local continuity properties.
Lightweight or training-free models could be developed for real-time analysis on resource-limited devices.
This raises the possibility that many sequential prediction tasks with smoothness priors need less complex learning than assumed.
Direct comparisons on non-sleep sequential data could test whether the adaptive smoothing is a broader architectural feature.

Load-bearing premise

Sleep sequences possess strong local temporal continuity that random self-attention exploits as an adaptive smoother, and that the RAPK formalization plus LSII and WTE metrics accurately isolate inductive bias effects from training contributions.

What would settle it

Testing the random Transformer on sleep sequences with local continuity removed through stage shuffling; if performance gains disappear and fall to baseline levels, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.09905 by Guisong Liu, Jiansong Zhang, Martin Dresler, Pengfei Wei, Xin Gao.

**Figure 2.** Figure 2: Sensitivity to window length [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity analysis of projection dimension ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of sleep stage transitions for subject [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: MSE between empirical random Transformer kernels and the closed-form RAPK prediction [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical versus theoretical RAPK kernel values across varying model widths [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of empirical random Transformer kernels and the closed-form RAPK predic [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of pre-softmax attention logits with and without Layer Normalization across [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Correlation analysis between sequence structure metrics and accuracy across 35 experimen [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Component ablation on Sleep-EDF-20. Top: ACC (%); Bottom: Weighted F1 (%). [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of feature heatmaps illustrating the smoothing effect of the Random Trans [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Average attention weights within a temporal window of size [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Average attention weights within a temporal window of size [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Activation distribution. Variance Preservation via Uniform Initialization. In stark contrast, the Xavier Uniform (Blue) and Kaiming Uniform (Purple) distributions almost perfectly overlap with the Original distribution (Orange), maintaining the native scale and variance of the input features. From the RAPK perspective, this variance preservation ensures that the attention scores sip operate in an optimal … view at source ↗

read the original abstract

Automatic sleep staging commonly adopts Transformers under the assumption that they learn complex long-range dependencies. We challenge this view by revealing a neglected property of sleep sequences: strong local temporal continuity. We show that a randomly initialized Transformer, without any training, substantially improves sleep staging performance and consistently outperforms heuristic smoothing. We formalize this effect via a Random Attention Prior Kernel (RAPK), showing that random self-attention acts as an adaptive smoother by balancing global averaging and content-based similarity while preserving stage transitions. Using two metrics, the Local Smoothness Influence Index (LSII) and the Weighted Transition Entropy (WTE), we provide evidence that most performance gains in Transformer-based sleep staging arise from architectural inductive bias rather than parameter learning. Our results suggest that sleep staging can be effectively addressed with structure-driven smoothing mechanisms rather than complex dependency modeling, enabling more efficient and edge-deployable healthcare systems for large-scale physiological monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Randomly initialized Transformers improve sleep staging as adaptive smoothers due to local continuity bias, with RAPK and new metrics to back the claim, though the separation from learning effects needs tighter checks.

read the letter

The main point is that a random Transformer, with no training at all, can act as an adaptive smoother on sleep sequences and beat basic heuristic smoothing, because the architecture already favors local temporal continuity in this kind of data. The authors formalize that behavior as a Random Attention Prior Kernel and introduce LSII and WTE to measure how much the smoothing influences local consistency versus transitions. This reframes why Transformer models succeed on sleep staging without relying on learned long-range dependencies. The practical angle toward edge-deployable systems is straightforward and useful. They credit the data property rather than overclaiming model power, which keeps the argument grounded. The metrics give a concrete way to test inductive bias claims that could apply beyond this task. The soft spots sit in the experimental layer. The abstract-level claims of substantial gains and bias explaining most performance rest on comparisons and ablations that need full numbers, effect sizes, and controls to hold up; without those details visible, it is hard to judge robustness or rule out dataset-specific artifacts. The RAPK formalization risks reading as a restatement of attention averaging properties unless it generates new testable predictions. Generalization across sleep datasets or other physiological signals is not yet clear. This work is for researchers in physiological time series, sleep analysis, and people studying attention biases for efficient models. Readers who want simpler alternatives to heavy trained Transformers or tools to quantify architectural effects would get direct value from the metrics and the reframing. It deserves a serious referee because the core observation is clean and the application is concrete, even if revisions would tighten the evidence. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that randomly initialized (untrained) Transformers substantially improve sleep staging by exploiting strong local temporal continuity in sleep sequences via architectural inductive bias, formalized as the Random Attention Prior Kernel (RAPK) that adaptively smooths while preserving transitions. It shows these random models consistently outperform heuristic smoothing baselines, and introduces LSII and WTE metrics to argue that most performance gains in Transformer-based sleep staging arise from this bias rather than parameter learning.

Significance. If the central empirical claims and metric isolations hold, the work meaningfully challenges the prevailing assumption that Transformers succeed in sleep staging primarily through learned long-range dependencies. It opens a path toward simpler, training-free or lightly-tuned structure-driven smoothers that could enable more efficient, edge-deployable physiological monitoring systems, with broader relevance to other sequential biomedical signals possessing local continuity.

major comments (2)

[§3.2] §3.2, RAPK definition: the formalization of random self-attention as an adaptive smoother is presented as derived from the architecture, but the derivation steps do not clearly demonstrate independence from the averaging properties already inherent in softmax attention; without an explicit non-circular reduction or proof that RAPK predicts smoothing behavior beyond restating the mechanism, the claim that it 'formalizes' the effect remains vulnerable to circularity.
[§4.3] §4.3 and Table 2: the LSII and WTE metrics are used to attribute 'most gains' to inductive bias, yet the ablation isolating random vs. trained models lacks reported effect sizes, confidence intervals, or statistical tests for the difference; without these, the quantitative claim that bias dominates learning cannot be fully assessed as load-bearing evidence.

minor comments (2)

[Abstract] Abstract and §1: the phrase 'substantially improves' and 'consistently outperforms' should be accompanied by concrete deltas (e.g., accuracy or F1 gains) and the specific heuristic smoothing methods being compared.
[§5] §5: the discussion of implications for edge deployment would benefit from a brief complexity analysis (FLOPs or memory) comparing the random Transformer to the heuristic baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped clarify and strengthen our presentation of the RAPK formalization and the supporting statistical evidence. We address each major comment point by point below.

read point-by-point responses

Referee: [§3.2] §3.2, RAPK definition: the formalization of random self-attention as an adaptive smoother is presented as derived from the architecture, but the derivation steps do not clearly demonstrate independence from the averaging properties already inherent in softmax attention; without an explicit non-circular reduction or proof that RAPK predicts smoothing behavior beyond restating the mechanism, the claim that it 'formalizes' the effect remains vulnerable to circularity.

Authors: We appreciate the referee highlighting the need for greater explicitness in the derivation. The RAPK is obtained by taking the expectation of the attention matrix under random Gaussian initialization of the query and key projections, which yields a kernel whose off-diagonal terms are modulated by input similarity rather than uniform averaging. To address the concern, the revised §3.2 now includes an expanded, step-by-step derivation that first isolates the random-projection component before applying softmax, followed by a short lemma showing that the resulting operator is not equivalent to a content-independent averager. This removes any appearance of circularity while preserving the original claim. revision: yes
Referee: [§4.3] §4.3 and Table 2: the LSII and WTE metrics are used to attribute 'most gains' to inductive bias, yet the ablation isolating random vs. trained models lacks reported effect sizes, confidence intervals, or statistical tests for the difference; without these, the quantitative claim that bias dominates learning cannot be fully assessed as load-bearing evidence.

Authors: We agree that the quantitative attribution would be more robust with formal statistical support. The revised §4.3 and Table 2 now report Cohen’s d effect sizes, 95 % confidence intervals, and paired t-test p-values for all random-versus-trained comparisons. These additions confirm that the performance gap is statistically significant and that the effect size attributable to the architectural bias is large, thereby strengthening the claim that inductive bias accounts for the majority of the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper presents RAPK as a formalization of observed smoothing behavior in random self-attention applied to locally continuous sleep sequences, supported by direct empirical comparisons against heuristic smoothers and the isolating metrics LSII/WTE. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or definitional restatement of the architecture's averaging properties; the central claim that inductive bias accounts for most gains rests on external data comparisons rather than internal equivalence. The argument is proportionate and does not invoke uniqueness theorems or ansatzes from prior self-work as justification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Central claim rests on the domain assumption of strong local temporal continuity in sleep sequences and the mathematical construction of RAPK as an adaptive smoother; no free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption Sleep sequences exhibit strong local temporal continuity
Described as a 'neglected property' that random attention exploits.

invented entities (3)

Random Attention Prior Kernel (RAPK) no independent evidence
purpose: Formalizes random self-attention as balancing global averaging and content-based similarity while preserving transitions
Newly introduced mathematical object to explain the observed smoothing effect.
Local Smoothness Influence Index (LSII) no independent evidence
purpose: Quantifies the contribution of local smoothness to performance gains
Newly proposed metric to separate architectural bias from learned parameters.
Weighted Transition Entropy (WTE) no independent evidence
purpose: Measures preservation of stage transitions under smoothing
Newly proposed metric supporting the adaptive smoother claim.

pith-pipeline@v0.9.0 · 5459 in / 1390 out tokens · 49864 ms · 2026-05-12T04:36:09.046347+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 3.1 (Structure of RAPK). ... E[KRAP]≈C0 11⊤ + C1 XX⊤
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

random self-attention acts as an adaptive smoother

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 3 internal anchors

[1]

and Koch, Philipp and Mertins, Alfred and De Vos, Maarten , journal=

Phan, Huy and Mikkelsen, Kaare and Chén, Oliver Y. and Koch, Philipp and Mertins, Alfred and De Vos, Maarten , journal=. SleepTransformer: Automatic Sleep Staging With Interpretability and Uncertainty Quantification , year=

work page
[2]

DeepSleepNet: A Model for Automatic Sleep Stage Scoring Based on Raw Single-Channel EEG , year=

Supratak, Akara and Dong, Hao and Wu, Chao and Guo, Yike , journal=. DeepSleepNet: A Model for Automatic Sleep Stage Scoring Based on Raw Single-Channel EEG , year=

work page
[3]

TinySleepNet: An Efficient Deep Learning Model for Sleep Stage Scoring based on Raw Single-Channel EEG , year=

Supratak, Akara and Guo, Yike , booktitle=. TinySleepNet: An Efficient Deep Learning Model for Sleep Stage Scoring based on Raw Single-Channel EEG , year=

work page
[6]

CareSleepNet: A Hybrid Deep Learning Network for Automatic Sleep Staging , year=

Wang, Jiquan and Zhao, Sha and Jiang, Haiteng and Zhou, Yangxuan and Yu, Zhenghe and Li, Tao and Li, Shijian and Pan, Gang , journal=. CareSleepNet: A Hybrid Deep Learning Network for Automatic Sleep Staging , year=

work page
[7]

Multi-View Self-Supervised Learning Enhances Automatic Sleep Staging From EEG Signals , year=

Yu, Tianyou and Hu, Xinxin and He, Yanbin and Wu, Wei and Gu, Zhenghui and Yu, Zhuliang and Li, Yuanqing and Wang, Fei and Xiao, Jun , journal=. Multi-View Self-Supervised Learning Enhances Automatic Sleep Staging From EEG Signals , year=

work page
[8]

medRxiv , pages=

A Multimodal Sleep Foundation Model Developed with 500K Hours of Sleep Recordings for Disease Predictions , author=. medRxiv , pages=. 2025 , publisher=

work page 2025
[9]

and Heremans, Elisabeth and Chén, Oliver Y

Phan, Huy and Lorenzen, Kristian P. and Heremans, Elisabeth and Chén, Oliver Y. and Tran, Minh C. and Koch, Philipp and Mertins, Alfred and Baumert, Mathias and Mikkelsen, Kaare B. and De Vos, Maarten , journal=. L-SeqSleepNet: Whole-cycle Long Sequence Modeling for Automatic Sleep Staging , year=

work page
[10]

Berry and Rita Brooks and Charlene Gamaldo and Susan M

Richard B. Berry and Rita Brooks and Charlene Gamaldo and Susan M. Harding and Robin M. Lloyd and Stuart F. Quan and Matthew T. Troester and Bradley V. Vaughn , title =. Journal of Clinical Sleep Medicine , volume =. 2017 , doi =

work page 2017
[13]

2025 , eprint=

Sleep Modulation: The Challenge of Transitioning from Open Loop to Closed Loop , author=. 2025 , eprint=

work page 2025
[14]

and Altevogt, Bruce , year =

Colten, H.R. and Altevogt, Bruce , year =. Sleep disorders and sleep deprivation: An unmet public health problem , isbn =

work page
[16]

and Kaiser,

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

work page
[17]

International Conference on Learning Representations (ICLR) , year=

iBOT: Image BERT Pre-Training with Online Tokenizer , author=. International Conference on Learning Representations (ICLR) , year=

work page
[19]

, journal=

Fonseca, Pedro and den Teuling, Niek and Long, Xi and Aarts, Ronald M. , journal=. Cardiorespiratory Sleep Stage Detection Using Conditional Random Fields , year=

work page
[21]

Long Short-Term Memory , year=

Hochreiter, Sepp and Schmidhuber, Jürgen , journal=. Long Short-Term Memory , year=

work page
[22]

2014 , eprint=

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , author=. 2014 , eprint=

work page 2014
[23]

and Vidal, René and Reiter, Austin and Hager, Gregory D

Lea, Colin and Flynn, Michael D. and Vidal, René and Reiter, Austin and Hager, Gregory D. , booktitle=. Temporal Convolutional Networks for Action Segmentation and Detection , year=

work page
[24]

2024 , eprint=

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. 2024 , eprint=

work page 2024
[27]

Proceedings of the 38th International Conference on Machine Learning , pages =

Attention is not all you need: pure attention loses rank doubly exponentially with depth , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021
[29]

Deep Image Prior , year=

Lempitsky, Victor and Vedaldi, Andrea and Ulyanov, Dmitry , booktitle=. Deep Image Prior , year=

work page
[30]

and De Vos, Maarten , journal=

Phan, Huy and Andreotti, Fernando and Cooray, Navin and Chén, Oliver Y. and De Vos, Maarten , journal=. SeqSleepNet: End-to-End Hierarchical Recurrent Neural Network for Sequence-to-Sequence Automatic Sleep Staging , year=

work page
[31]

An Attention-Based Deep Learning Approach for Sleep Stage Classification With Single-Channel EEG , year=

Eldele, Emadeldeen and Chen, Zhenghua and Liu, Chengyu and Wu, Min and Kwoh, Chee-Keong and Li, Xiaoli and Guan, Cuntai , journal=. An Attention-Based Deep Learning Approach for Sleep Stage Classification With Single-Channel EEG , year=

work page
[33]

2022 , eprint=

How Do Vision Transformers Work? , author=. 2022 , eprint=

work page 2022
[35]

and De Silva, Anjula C

Pradeepkumar, Jathurshan and Anandakumar, Mithunjha and Kugathasan, Vinith and Suntharalingham, Dhinesh and Kappel, Simon L. and De Silva, Anjula C. and Edussooriya, Chamira U. S. , journal=. Toward Interpretable Sleep Stage Classification Using Cross-Modal Transformers , year=

work page
[36]

Evaluating the Influence of Temporal Context on Automatic Mouse Sleep Staging through the Application of Human Models , year=

Ciudad, Javier García and Mørup, Morten and Kornum, Birgitte Rahbek and Zahid, Alexander Neergaard , booktitle=. Evaluating the Influence of Temporal Context on Automatic Mouse Sleep Staging through the Application of Human Models , year=

work page
[38]

Biomedical signal processing and control , volume=

Intra-and inter-epoch temporal context network (IITNet) using sub-epoch features for automatic sleep scoring on raw single-channel EEG , author=. Biomedical signal processing and control , volume=. 2020 , publisher=

work page 2020
[39]

and Ogg, Mattson , booktitle=

Coon, William G. and Ogg, Mattson , booktitle=. Laying the Foundation: Modern Transformers for Gold-Standard Sleep Analysis and Beyond , year=

work page
[40]

Proceedings of the 21st International Conference on Neural Information Processing Systems , pages =

Rahimi, Ali and Recht, Benjamin , title =. Proceedings of the 21st International Conference on Neural Information Processing Systems , pages =. 2007 , isbn =

work page 2007
[41]

2020 , eprint=

Deep Echo State Network (DeepESN): A Brief Survey , author=. 2020 , eprint=

work page 2020
[42]

2021 , eprint=

Random Feature Attention , author=. 2021 , eprint=

work page 2021
[43]

2020 , eprint=

Infinite attention: NNGP and NTK for deep attention networks , author=. 2020 , eprint=

work page 2020
[44]

2022 , eprint=

Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice , author=. 2022 , eprint=

work page 2022
[45]

2025 , eprint=

AnySleep: a channel-agnostic deep learning system for high-resolution sleep staging in multi-center cohorts , author=. 2025 , eprint=

work page 2025
[48]

Physiological measurement , volume=

Automatic sleep staging of EEG signals: recent development, challenges, and future directions , author=. Physiological measurement , volume=. 2022 , publisher=

work page 2022
[49]

Luyster, Jr

Faith S. Luyster, Jr. Strollo, Patrick J., Phyllis C. Zee, and James K. Walsh. Sleep: A health imperative. Sleep, 35 0 (6): 0 727--734, 06 2012. ISSN 0161-8105. doi:10.5665/sleep.1846. URL https://doi.org/10.5665/sleep.1846

work page doi:10.5665/sleep.1846 2012
[50]

Automatic sleep staging of eeg signals: recent development, challenges, and future directions

Huy Phan and Kaare Mikkelsen. Automatic sleep staging of eeg signals: recent development, challenges, and future directions. Physiological measurement, 43 0 (4): 0 04TR01, 2022

work page 2022
[51]

Large-scale automated sleep staging

Haoqi Sun, Jian Jia, Balaji Goparaju, Guang-Bin Huang, Olga Sourina, Matt Travis Bianchi, and M Brandon Westover. Large-scale automated sleep staging. Sleep, 40 0 (10): 0 zsx139, 09 2017. ISSN 0161-8105. doi:10.1093/sleep/zsx139. URL https://doi.org/10.1093/sleep/zsx139

work page doi:10.1093/sleep/zsx139 2017
[52]

Sleep modulation: The challenge of transitioning from open loop to closed loop, 2025

Guisong Liu, Jiansong Zhang, Yinpei Luo, Guoliang Wei, Shuqing Sun, Shiyang Deng, Pengfei Wei, and Nanxi Chen. Sleep modulation: The challenge of transitioning from open loop to closed loop, 2025. URL https://arxiv.org/abs/2512.03784

work page arXiv 2025
[53]

o fer, Alejandra Alonso, Luciana Besedovsky, Keith Murphy, Emma Peters, Karolina Raczek, Bj\

Elena Krugliakova, Friederike Breuer, Nico Adelh\" o fer, Alejandra Alonso, Luciana Besedovsky, Keith Murphy, Emma Peters, Karolina Raczek, Bj\" o rn Rasch, Leila Salvesen, Sophia Snipes, Sarah Schoch, Thomas Schreiner, Rick Wassing, Til Ole Bergmann, and Martin Dresler. Hacking the functions of sleep: noninvasive approaches to stimulate sleep neurophysio...

work page doi:10.1152/physrev.00007.2025 2026
[54]

Lorenzen, Elisabeth Heremans, Oliver Y

Huy Phan, Kristian P. Lorenzen, Elisabeth Heremans, Oliver Y. Chén, Minh C. Tran, Philipp Koch, Alfred Mertins, Mathias Baumert, Kaare B. Mikkelsen, and Maarten De Vos. L-seqsleepnet: Whole-cycle long sequence modeling for automatic sleep staging. IEEE Journal of Biomedical and Health Informatics, 27 0 (10): 0 4748--4757, 2023. doi:10.1109/JBHI.2023.3303197

work page doi:10.1109/jbhi.2023.3303197 2023
[55]

Colten and Bruce Altevogt

H.R. Colten and Bruce Altevogt. Sleep disorders and sleep deprivation: An unmet public health problem. National Academies Press, 10 2006. ISBN 9780309101110. doi:10.17226/11617

work page doi:10.17226/11617 2006
[56]

Berry, Rita Brooks, Charlene Gamaldo, Susan M

Richard B. Berry, Rita Brooks, Charlene Gamaldo, Susan M. Harding, Robin M. Lloyd, Stuart F. Quan, Matthew T. Troester, and Bradley V. Vaughn. Aasm scoring manual updates for 2017 (version 2.4). Journal of Clinical Sleep Medicine, 13 0 (05): 0 665--666, 2017. doi:10.5664/jcsm.6576. URL https://jcsm.aasm.org/doi/abs/10.5664/jcsm.6576

work page doi:10.5664/jcsm.6576 2017
[57]

Lightsleepnet: Design of a personalized portable sleep staging system based on single-channel eeg

Yiqiao Liao, Chao Zhang, Milin Zhang, Zhihua Wang, and Xiang Xie. Lightsleepnet: Design of a personalized portable sleep staging system based on single-channel eeg. IEEE Transactions on Circuits and Systems II: Express Briefs, 69 0 (1): 0 224–228, January 2022. ISSN 1558-3791. doi:10.1109/tcsii.2021.3086981. URL http://dx.doi.org/10.1109/TCSII.2021.3086981

work page doi:10.1109/tcsii.2021.3086981 2022
[58]

Micro sleepnet: efficient deep learning model for mobile terminal real-time sleep staging

Guisong Liu, Guoliang Wei, Shuqing Sun, Dandan Mao, Jiansong Zhang, Dechun Zhao, Xuelong Tian, Xing Wang, and Nanxi Chen. Micro sleepnet: efficient deep learning model for mobile terminal real-time sleep staging. Frontiers in Neuroscience, Volume 17 - 2023, 2023. ISSN 1662-453X. doi:10.3389/fnins.2023.1218072. URL https://www.frontiersin.org/journals/neur...

work page doi:10.3389/fnins.2023.1218072 2023
[59]

Deepsleepnet: A model for automatic sleep stage scoring based on raw single-channel eeg

Akara Supratak, Hao Dong, Chao Wu, and Yike Guo. Deepsleepnet: A model for automatic sleep stage scoring based on raw single-channel eeg. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 25 0 (11): 0 1998--2008, 2017. doi:10.1109/TNSRE.2017.2721116

work page doi:10.1109/tnsre.2017.2721116 1998
[60]

Alberto Borghese, and Simona Ferrante

Akara Supratak and Yike Guo. Tinysleepnet: An efficient deep learning model for sleep stage scoring based on raw single-channel eeg. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 641--644, 2020. doi:10.1109/EMBC44109.2020.9176741

work page doi:10.1109/embc44109.2020.9176741 2020
[61]

Multi-view self-supervised learning enhances automatic sleep staging from eeg signals

Tianyou Yu, Xinxin Hu, Yanbin He, Wei Wu, Zhenghui Gu, Zhuliang Yu, Yuanqing Li, Fei Wang, and Jun Xiao. Multi-view self-supervised learning enhances automatic sleep staging from eeg signals. IEEE Transactions on Biomedical Engineering, 72 0 (10): 0 3056--3070, 2025. doi:10.1109/TBME.2025.3561228

work page doi:10.1109/tbme.2025.3561228 2025
[62]

A multimodal sleep foundation model developed with 500k hours of sleep recordings for disease predictions

Rahul Thapa, Magnus Ruud Kj r, Bryan He, Ian Covert, Hyatt Moore, Umaer Hanif, Gauri Ganjoo, Brandon M Westover, Poul Jennum, Andreas Brink-Kj r, et al. A multimodal sleep foundation model developed with 500k hours of sleep recordings for disease predictions. medRxiv, pages 2025--02, 2025

work page 2025
[63]

An attention-based deep learning approach for sleep stage classification with single-channel eeg

Emadeldeen Eldele, Zhenghua Chen, Chengyu Liu, Min Wu, Chee-Keong Kwoh, Xiaoli Li, and Cuntai Guan. An attention-based deep learning approach for sleep stage classification with single-channel eeg. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 29: 0 809--818, 2021. doi:10.1109/TNSRE.2021.3076234

work page doi:10.1109/tnsre.2021.3076234 2021
[64]

Chén, Philipp Koch, Alfred Mertins, and Maarten De Vos

Huy Phan, Kaare Mikkelsen, Oliver Y. Chén, Philipp Koch, Alfred Mertins, and Maarten De Vos. Sleeptransformer: Automatic sleep staging with interpretability and uncertainty quantification. IEEE Transactions on Biomedical Engineering, 69 0 (8): 0 2456--2467, 2022. doi:10.1109/TBME.2022.3147187

work page doi:10.1109/tbme.2022.3147187 2022
[65]

A unified flexible large psg model for sleep staging and brain disorder diagnosis

Guifeng Deng, Mengfan Niu, Shuying Rao, Yuxi Luo, Jianjia Zhang, Junyi Xie, Zhenghe Yu, Wenjuan Liu, Junhang Zhang, Sha Zhao, Gang Pan, Xiaojing Li, Wei Deng, Wanjun Guo, Yaoyun Zhang, Tao Li, and Haiteng Jiang. A unified flexible large psg model for sleep staging and brain disorder diagnosis. medRxiv, 2024. doi:10.1101/2024.12.11.24318815. URL https://ww...

work page doi:10.1101/2024.12.11.24318815 2024
[66]

Caresleepnet: A hybrid deep learning network for automatic sleep staging

Jiquan Wang, Sha Zhao, Haiteng Jiang, Yangxuan Zhou, Zhenghe Yu, Tao Li, Shijian Li, and Gang Pan. Caresleepnet: A hybrid deep learning network for automatic sleep staging. IEEE Journal of Biomedical and Health Informatics, 28 0 (12): 0 7392--7405, 2024. doi:10.1109/JBHI.2024.3426939

work page doi:10.1109/jbhi.2024.3426939 2024
[67]

Sleepdifformer: Sleep stage classification via multivariate differential transformer

Benjamin Wei Hao Chin, Yuin Torng Yew, Haocheng Wu, Lanxin Liang, Chow Khuen Chan, Norita Mohd Zain, Siti Balqis Samdin, and Sim Kuan Goh. Sleepdifformer: Sleep stage classification via multivariate differential transformer. arXiv preprint arXiv:2508.15215, 2025

work page arXiv 2025
[68]

FlexSleepTransformer : a transformer-based sleep staging model with flexible input channel configurations

Yanchen Guo, Maciej Nowakowski, and Weiying Dai. FlexSleepTransformer : a transformer-based sleep staging model with flexible input channel configurations. 14 0 (1): 0 26312, 2024. ISSN 2045-2322. doi:10.1038/s41598-024-76197-0. URL https://doi.org/10.1038/s41598-024-76197-0

work page doi:10.1038/s41598-024-76197-0 2024
[69]

Explainable vision transformer for automatic visual sleep staging on multimodal PSG signals

Hyojin Lee, You Rim Choi, Hyun Kyung Lee, Jaemin Jeong, Joopyo Hong, Hyun-Woo Shin, and Hyung-Sin Kim. Explainable vision transformer for automatic visual sleep staging on multimodal PSG signals. 8 0 (1): 0 55, 2025. ISSN 2398-6352. doi:10.1038/s41746-024-01378-0. URL https://doi.org/10.1038/s41746-024-01378-0

work page doi:10.1038/s41746-024-01378-0 2025
[70]

In: 2024 46th Annu

William G. Coon and Mattson Ogg. Laying the foundation: Modern transformers for gold-standard sleep analysis and beyond. In 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 1--7, 2024. doi:10.1109/EMBC53108.2024.10782964

work page doi:10.1109/embc53108.2024.10782964 2024
[71]

In: 2024 46th Annu

Javier García Ciudad, Morten Mørup, Birgitte Rahbek Kornum, and Alexander Neergaard Zahid. Evaluating the influence of temporal context on automatic mouse sleep staging through the application of human models. In 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 1--4, 2024. doi:10.1109/EMBC5310...

work page doi:10.1109/embc53108.2024.10782771 2024
[72]

Kappel, Anjula C

Jathurshan Pradeepkumar, Mithunjha Anandakumar, Vinith Kugathasan, Dhinesh Suntharalingham, Simon L. Kappel, Anjula C. De Silva, and Chamira U. S. Edussooriya. Toward interpretable sleep stage classification using cross-modal transformers. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 32: 0 2893--2904, 2024. doi:10.1109/TNSRE.2024.3438610

work page doi:10.1109/tnsre.2024.3438610 2024
[73]

Gomez, ukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964

work page 2017
[74]

Chén, and Maarten De Vos

Huy Phan, Fernando Andreotti, Navin Cooray, Oliver Y. Chén, and Maarten De Vos. Seqsleepnet: End-to-end hierarchical recurrent neural network for sequence-to-sequence automatic sleep staging. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 27 0 (3): 0 400--410, 2019. doi:10.1109/TNSRE.2019.2896659

work page doi:10.1109/tnsre.2019.2896659 2019
[75]

Reservoir computing approaches to recurrent neural network training

Mantas Lukoševičius and Herbert Jaeger. Reservoir computing approaches to recurrent neural network training. 3 0 (3): 0 127--149. ISSN 1574-0137. doi:https://doi.org/10.1016/j.cosrev.2009.03.005. URL https://www.sciencedirect.com/science/article/pii/S1574013709000173

work page doi:10.1016/j.cosrev.2009.03.005 2009
[76]

Intra-and inter-epoch temporal context network (iitnet) using sub-epoch features for automatic sleep scoring on raw single-channel eeg

Hogeon Seo, Seunghyeok Back, Seongju Lee, Deokhwan Park, Tae Kim, and Kyoobin Lee. Intra-and inter-epoch temporal context network (iitnet) using sub-epoch features for automatic sleep scoring on raw single-channel eeg. Biomedical signal processing and control, 61: 0 102037, 2020

work page 2020
[77]

Random features for large-scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS'07, page 1177–1184, Red Hook, NY, USA, 2007. Curran Associates Inc. ISBN 9781605603520

work page 2007
[78]

Deep echo state network (deepesn): A brief survey, 2020

Claudio Gallicchio and Alessio Micheli. Deep echo state network (deepesn): A brief survey, 2020. URL https://arxiv.org/abs/1712.04323

work page arXiv 2020
[79]

Deep image prior

Victor Lempitsky, Andrea Vedaldi, and Dmitry Ulyanov. Deep image prior. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9446--9454, 2018. doi:10.1109/CVPR.2018.00984

work page doi:10.1109/cvpr.2018.00984 2018
[80]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020

work page internal anchor Pith review arXiv 2009
[81]

Random feature attention.arXiv preprint arXiv:2103.02143,

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. Random feature attention, 2021. URL https://arxiv.org/abs/2103.02143

work page arXiv 2021
[82]

Transformer Dissection: A Unified Understanding for T ransformer ' s Attention via the Lens of Kernel

Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: An unified understanding for transformer ' s attention via the lens of kernel. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing ...

work page doi:10.18653/v1/d19-1443 2019
[83]

arXiv preprint arXiv:2202.06709 , year=

Namuk Park and Songkuk Kim. How do vision transformers work?, 2022. URL https://arxiv.org/abs/2202.06709

work page arXiv 2022
[84]

Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From the- ory to practice.arXiv preprint arXiv:2203.05962, 2022

Peihao Wang, Wenqing Zheng, Tianlong Chen, and Zhangyang Wang. Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice, 2022. URL https://arxiv.org/abs/2203.05962

work page arXiv 2022
[85]

Neural tangent kernel: convergence and generalization in neural networks (invited paper)

Arthur Jacot, Franck Gabriel, and Cl\' e ment Hongler. Neural tangent kernel: convergence and generalization in neural networks (invited paper). In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, page 6, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380539. doi:10.1145/3406325.3465355. URL...

work page doi:10.1145/3406325.3465355 2021
[86]

Infinite attention: Nngp and ntk for deep attention networks, 2020

Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, and Roman Novak. Infinite attention: Nngp and ntk for deep attention networks, 2020. URL https://arxiv.org/abs/2006.10540

work page arXiv 2020
[87]

PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals

Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. Physiobank, physiotoolkit, and physionet. Circulation, 101 0 (23): 0 e215--e220, 2000. doi:10.1161/01.CIR.101.23.e215. URL https://www.ahajournals.org/doi/abs/10.1161/01.CIR.10...

work page doi:10.1161/01.cir.101.23.e215 2000
[88]

Quan, Barbara V

Stuart F. Quan, Barbara V. Howard, Conrad Iber, James P. Kiley, F. Javier Nieto, George T. O'Connor, David M. Rapoport, Susan Redline, John Robbins, Jonathan M. Samet, and ‡Patricia W. Wahl. The sleep heart health study: Design, rationale, and methods. Sleep, 20 0 (12): 0 1077--1085, 12 1997. ISSN 0161-8105. doi:10.1093/sleep/20.12.1077. URL https://doi.o...

work page doi:10.1093/sleep/20.12.1077 1997
[89]

ibot: Image bert pre-training with online tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022

work page 2022
[90]

Neural Computation 9(8), 1735–1780 (1997)

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9 0 (8): 0 1735--1780, 1997. doi:10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997
[91]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014. URL https://arxiv.org/abs/1412.3555

work page internal anchor Pith review Pith/arXiv arXiv 2014
[92]

Flynn, René Vidal, Austin Reiter, and Gregory D

Colin Lea, Michael D. Flynn, René Vidal, Austin Reiter, and Gregory D. Hager. Temporal convolutional networks for action segmentation and detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1003--1012, 2017. doi:10.1109/CVPR.2017.113

work page doi:10.1109/cvpr.2017.113 2017
[93]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2024
[94]

Anysleep: a channel-agnostic deep learning system for high-resolution sleep staging in multi-center cohorts, 2025

Niklas Grieger, Jannik Raskob, Siamak Mehrkanoon, and Stephan Bialonski. Anysleep: a channel-agnostic deep learning system for high-resolution sleep staging in multi-center cohorts, 2025. URL https://arxiv.org/abs/2512.14461

work page arXiv 2025
[95]

Pedro Fonseca, Niek den Teuling, Xi Long, and Ronald M. Aarts. Cardiorespiratory sleep stage detection using conditional random fields. IEEE Journal of Biomedical and Health Informatics, 21 0 (4): 0 956--966, 2017. doi:10.1109/JBHI.2016.2550104

work page doi:10.1109/jbhi.2016.2550104 2017