pith. sign in

arxiv: 2605.22859 · v1 · pith:ANERXOP3new · submitted 2026-05-19 · 📡 eess.SP · cs.AI

Staging by the Book: Automatic Sleep Stage Classification Using Scoring Rules

Pith reviewed 2026-05-25 06:23 UTC · model grok-4.3

classification 📡 eess.SP cs.AI
keywords sleep stagingAASM rulesrule-based classificationpolysomnographyexplainable methodsautomatic sleep scoringdeterministic staging
0
0 comments X

The pith

A rule-based system encodes AASM sleep scoring guidelines as executable code to produce classifications and explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a deterministic alternative to machine learning for automatic sleep stage classification. It translates the AASM manual's scoring rules into code that processes polysomnography signals and outputs both a stage and a natural language justification for each 30-second epoch. Tested on 50 recordings against a majority vote from ten scorers, the system reaches 60.5 percent agreement overall. The design prioritizes transparency and rule adherence over matching the highest possible accuracy of black-box models. This makes the method suitable for verifying other automated systems and for clinical oversight.

Core claim

The paper introduces a rule-based sleep staging algorithm that directly implements the AASM scoring manual in software, including an explanation trace that converts the decision path into readable text. On a test set of 50 PSG recordings the algorithm agrees with the ten-scorer consensus reference in 60.5 percent of epochs with a kappa of 0.42, performing best on N2 and R stages. The resulting decisions are fully determined by the encoded rules and come with justifications that mirror clinical reasoning.

What carries the argument

An executable encoding of the AASM sleep staging rules together with an explanation trace that generates epoch-level natural-language justifications.

If this is right

  • The method supplies verifiable, rule-following decisions that can audit opaque machine learning models.
  • Natural language explanations allow clinicians to inspect why a particular stage was assigned.
  • Deterministic behavior eliminates variability from training data or model initialization.
  • Lower agreement than deep learning models is accepted in exchange for explicit alignment with clinical guidelines.
  • Performance differences between development and test sets indicate that implementation details affect outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Disagreements with human consensus may point to specific ambiguities in the AASM guidelines that need clarification.
  • The rule set could be used to create large volumes of labeled data for training more accurate yet still interpretable models.
  • Similar rule translations might apply to other standardized medical scoring procedures beyond sleep.
  • Integration with signal processing pipelines could allow real-time staging during recordings.

Load-bearing premise

The AASM scoring rules are sufficiently precise and complete to be converted into deterministic code without significant loss of the judgment human experts apply to edge cases.

What would settle it

Expert review of the code's output on a new set of epochs that identifies systematic misapplications of the AASM rules arising from unencoded ambiguities.

Figures

Figures reproduced from arXiv: 2605.22859 by Anna Sigridur Islind, Emil Hardarson, Erna Sif Arnard\'ottir, Konstantin Popov, Mar\'ia \'Oskarsd\'ottir, Sigridur Sigurdardottir.

Figure 1
Figure 1. Figure 1: Schematic overview of the automatic sleep staging process. The [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: One epoch with micro-annotations. The displayed channels include [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of a sequential elimination trace produced by the rule-based [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of a natural-language explanation dialogue produced by the [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The top panel shows the human consensus hypnogram for the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The top panel shows the human consensus hypnogram for one of [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of human inter-scorer agreement for epochs where the [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
read the original abstract

Automated sleep staging is commonly approached as a supervised machine learning problem, with deep learning methods dominating recent research. While machine learning models achieve near-human level agreement with human-scored reference sleep stages, their decisions are typically opaque and not designed to follow clinical scoring rules. We propose a transparent alternative: a deterministic, rule-based sleep staging method that explicitly operationalizes the American Academy of Sleep Medicine's (AASM) scoring logic as executable code, coupled with epoch-level natural-language justifications derived from an explanation trace. We evaluate the approach on 50 polysomnography recordings with a 10-scorer majority-vote consensus as reference. Across all recordings, the method agreed with the majority-vote reference in 60.5% of epochs ($\kappa=0.42$), with substantially higher agreement on a dataset used during development (77.1%, $\kappa=0.61$). Agreement with the reference was highest for sleep stage N2 (recall 83.5%) and moderate for sleep stage R (recall 68.7%), while Wake and N1 recall were low. Despite lower agreement with the reference than contemporary deep learning models, the method provides deterministic decisions and natural language explanations aligned with AASM scoring rules, making it a complementary tool for auditing, debugging, and governing deep learning-based sleep staging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a deterministic, rule-based sleep staging algorithm that translates AASM scoring rules into executable code, generating epoch-level natural-language explanations from an internal trace. Evaluated on 50 PSG recordings against a 10-scorer majority-vote reference, it achieves 60.5% epoch agreement (κ=0.42) overall and 77.1% (κ=0.61) on a development subset, with highest recall for N2 and lower for Wake/N1; the work positions this as a transparent complement to opaque deep-learning models for auditing and governance.

Significance. If the rule translations prove faithful, the approach supplies a reproducible, parameter-free baseline that can serve as an auditing tool for ML sleep-staging systems and as an educational or regulatory reference. The explicit use of an external consensus reference and the generation of human-readable justifications are concrete strengths that address a recognized gap in interpretability. The lower absolute agreement relative to contemporary DL models is expected and does not diminish the potential utility for verification tasks.

major comments (3)
  1. [§2] §2 (Rule Implementation): The description of the executable AASM encoding does not specify the concrete numerical cutoffs, tie-breaking procedures, or edge-case resolutions chosen for inherently ambiguous manual criteria (e.g., amplitude thresholds for slow waves, K-complex detection, or contextual stage-transition rules). Without these details or an external validation against multiple scorers on ambiguous epochs, the central claim that the code constitutes a faithful, lossless operationalization cannot be assessed.
  2. [Abstract and Evaluation] Abstract and Evaluation section: The 16.6-point gap between development-set agreement (77.1%) and overall agreement (60.5%) raises the possibility that implementation choices were tuned to the development recordings. This directly undermines the asserted deterministic, non-data-dependent character of the method and must be resolved by documenting a strict separation with no post-hoc adjustments.
  3. [Results] Results: No per-epoch or per-recording breakdown is provided that isolates performance on epochs where the 10 human scorers themselves disagree; such an analysis is required to determine whether the reported 60.5% agreement reflects intrinsic limits of the AASM rules or artifacts introduced by the deterministic encoding.
minor comments (2)
  1. The manuscript should include at least one full worked example of an epoch trace with the generated natural-language justification in the main text or a clearly labeled supplementary figure.
  2. [§2] The version of the AASM manual being operationalized (2012 or later) and any explicit deviations from the printed guidelines should be stated in §2.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address the major comments point by point below, proposing revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§2] §2 (Rule Implementation): The description of the executable AASM encoding does not specify the concrete numerical cutoffs, tie-breaking procedures, or edge-case resolutions chosen for inherently ambiguous manual criteria (e.g., amplitude thresholds for slow waves, K-complex detection, or contextual stage-transition rules). Without these details or an external validation against multiple scorers on ambiguous epochs, the central claim that the code constitutes a faithful, lossless operationalization cannot be assessed.

    Authors: We agree that additional details on the specific numerical thresholds and handling of edge cases are necessary to allow full assessment of the implementation's fidelity. In the revised manuscript, we will include an expanded section or supplementary material that lists all concrete cutoffs (e.g., for delta wave amplitude, K-complex criteria) and tie-breaking rules used in the code. We will also reference the open-source implementation for complete transparency. While we cannot perform new external validation on ambiguous epochs without additional data, the majority-vote reference already reflects inter-scorer variability, and we will add a note on this limitation. revision: yes

  2. Referee: [Abstract and Evaluation] Abstract and Evaluation section: The 16.6-point gap between development-set agreement (77.1%) and overall agreement (60.5%) raises the possibility that implementation choices were tuned to the development recordings. This directly undermines the asserted deterministic, non-data-dependent character of the method and must be resolved by documenting a strict separation with no post-hoc adjustments.

    Authors: The development subset was employed only during the initial coding phase to verify that the rule translations produced reasonable explanations on a small number of recordings; no quantitative metrics were optimized, and no adjustments were made based on the full evaluation results. The method remains fully deterministic with no learned parameters. To address the concern, we will revise the manuscript to clearly document the development recordings used, confirm that no post-hoc changes were applied after the full evaluation, and emphasize that the performance difference arises from the varying difficulty across recordings rather than data-dependent tuning. revision: yes

  3. Referee: [Results] Results: No per-epoch or per-recording breakdown is provided that isolates performance on epochs where the 10 human scorers themselves disagree; such an analysis is required to determine whether the reported 60.5% agreement reflects intrinsic limits of the AASM rules or artifacts introduced by the deterministic encoding.

    Authors: We concur that dissecting performance on epochs with high inter-scorer disagreement would help isolate the sources of discrepancy. However, the dataset provides only the majority-vote labels and not the individual scorer annotations per epoch, which precludes this specific analysis. We will add a discussion of this limitation in the revised paper and note that the overall agreement with the consensus serves as a conservative estimate. If individual scorer data were available, such a breakdown could be performed in future extensions. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation from external AASM rules is self-contained

full rationale

The paper's core method is an explicit translation of external AASM scoring guidelines into deterministic code, with no equations, fitted parameters, or self-citations forming the load-bearing chain. The development-set agreement (77.1%) is reported separately from the primary evaluation on the 50-recording majority-vote set (60.5%), without presenting the development result as an independent prediction or validation. Implementation choices for ambiguities are acknowledged as necessary but do not reduce the central claim to a fit or self-definition; the transparency argument rests on the external rule source rather than internal data tuning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that AASM clinical guidelines are sufficiently precise and unambiguous to be fully encoded as deterministic logic without loss of meaning. No free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption AASM scoring rules can be fully and unambiguously translated into deterministic executable code
    The paper assumes the clinical guidelines are precise enough for direct coding without loss of nuance.

pith-pipeline@v0.9.0 · 5798 in / 1281 out tokens · 39715 ms · 2026-05-25T06:23:15.225536+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 2 internal anchors

  1. [1]

    Artificial Intelligence Models for the Automation of Standard Diagnostics in Sleep Medicine—A Systematic Review.Bioengineering, 11(3):206, March 2024

    Maha Alattar, Alok Govind, and Shraddha Mainali. Artificial Intelligence Models for the Automation of Standard Diagnostics in Sleep Medicine—A Systematic Review.Bioengineering, 11(3):206, March 2024. ISSN 2306-

  2. [2]

    URLhttps://www.mdpi.com /2306-5354/11/3/206

    doi: 10.3390/bioengineering11030206. URLhttps://www.mdpi.com /2306-5354/11/3/206. Number: 3

  3. [3]

    A Systematic Review of Literature on Automated Sleep Scoring.IEEE Access, 10:79419–79443, 2022

    Hadeel Alsolai, Shahnawaz Qureshi, Syed Muhammad Zeeshan Iqbal, Sirirut Vanichayobon, Lawrence Edward Henesey, Craig Lindley, and Seppo Karrila. A Systematic Review of Literature on Automated Sleep Scoring.IEEE Access, 10:79419–79443, 2022. ISSN 2169-3536. doi: 10.1109/ACCESS.2022.3194145. URLhttps://ieeexplore.ieee.or g/document/9841539. Conference Name:...

  4. [4]

    Madai, and the Precise4Q consortium

    Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, Vince I. Madai, and the Precise4Q consortium. Explainability for artificial intelli- gence in healthcare: a multidisciplinary perspective.BMC Medical Infor- matics and Decision Making, 20(1):310, November 2020. ISSN 1472-6947. 17 N3 N2 N1 REM WakeMajority vote 01 02 03 04 05 06 07 08 Time (hour)...

  5. [5]

    Barbanoj, Heidi Danker-Hopfe, Sari-Leena Himanen, Bob Kemp, Thomas Penzel, Michael Grözinger, Dieter Kunz, Peter Rappelsberger, Alois Schlögl, and Georg Dorffner

    Peter Anderer, Georg Gruber, Silvia Parapatics, Michael Woertz, Tatiana Miazhynskaia, Gerhard Klösch, Bernd Saletu, Josef Zeitlhofer, Manuel J. Barbanoj, Heidi Danker-Hopfe, Sari-Leena Himanen, Bob Kemp, Thomas Penzel, Michael Grözinger, Dieter Kunz, Peter Rappelsberger, Alois Schlögl, and Georg Dorffner. An E-Health Solution for Automatic Sleep Classific...

  6. [6]

    Saletu-Zyhlarz, Heidi Danker-Hopfe, Josef Zeitlhofer, and Georg Dorffner

    Peter Anderer, Arnaud Moreau, Michael Woertz, Marco Ross, Georg Gruber, Silvia Parapatics, Erna Loretz, Esther Heller, Andrea Schmidt, Marion Boeck, Doris Moser, Gerhard Kloesch, Bernd Saletu, Gerda M. Saletu-Zyhlarz, Heidi Danker-Hopfe, Josef Zeitlhofer, and Georg Dorffner. Computer-assisted sleep classification according to the standard of the American ...

  7. [7]

    Overview of the hypnodensity approach to scoring sleep for polysomnography and home sleep testing.Frontiers in Sleep, 2,

    Peter Anderer, Marco Ross, Andreas Cerny, Ray Vasko, Edmund Shaw, and Pedro Fonseca. Overview of the hypnodensity approach to scoring sleep for polysomnography and home sleep testing.Frontiers in Sleep, 2,

  8. [8]

    URLhttps://www.frontiersin.org/articles /10.3389/frsle.2023.1163477

    ISSN 2813-2890. URLhttps://www.frontiersin.org/articles /10.3389/frsle.2023.1163477

  9. [9]

    Jessie P Bakker, Marco Ross, Andreas Cerny, Ray Vasko, Edmund Shaw, Samuel Kuna, Ulysses J Magalang, Naresh M Punjabi, and Peter An- derer. Scoring sleep with artificial intelligence enables quantification of 18 sleep stage ambiguity: hypnodensity based on multiple expert scorers and auto-scoring.Sleep, 46(2):zsac154, February 2023. ISSN 0161-8105. doi: 1...

  10. [10]

    Validation of the Somnolyzer 24×7 automatic scoring system in children with suspected obstructive sleep apnea.Frontiers in Medicine, 12, June

    Ignacio Boira, Violeta Esteban, José Norberto Sancho-Chust, Esther Pas- tor, Paula Fernández-Martínez, Anastasiya Torba, and Eusebi Chiner. Validation of the Somnolyzer 24×7 automatic scoring system in children with suspected obstructive sleep apnea.Frontiers in Medicine, 12, June

  11. [11]

    doi: 10.3389/fmed.2025.1617530

    ISSN 2296-858X. doi: 10.3389/fmed.2025.1617530. URL https://www.frontiersin.org/journals/medicine/articles/10. 3389/fmed.2025.1617530/full

  12. [12]

    Braun, M

    M. Braun, M. Stockhoff, M. Tijssen, S. Dietz-Terjung, S. Coughlin, and C. Schöbel. A Systematic Review on the Technical Feasibility of Home- Polysomnography for Diagnosis of Sleep Disorders in Adults.Current Sleep Medicine Reports, 10(2):276–288, June 2024. ISSN 2198-6401. doi: 10.100 7/s40675-024-00301-z. URLhttps://doi.org/10.1007/s40675-024-0 0301-z

  13. [13]

    A review of automated sleep stage scoring based on physiological signals for the new millennia.Computer Methods and Pro- grams in Biomedicine, 176:81–91, July 2019

    Oliver Faust, Hajar Razaghi, Ragab Barika, Edward J Ciaccio, and U Ra- jendra Acharya. A review of automated sleep stage scoring based on physiological signals for the new millennia.Computer Methods and Pro- grams in Biomedicine, 176:81–91, July 2019. ISSN 0169-2607. doi: 10.1016/j.cmpb.2019.04.032. URLhttps://www.sciencedirect.co m/science/article/pii/S0...

  14. [14]

    Bassetti, and Francesca D

    Luigi Fiorillo, Alessandro Puiatti, Michela Papandrea, Pietro-Luca Ratti, Paolo Favaro, Corinne Roth, Panagiotis Bargiotas, Claudio L. Bassetti, and Francesca D. Faraci. Automated sleep scoring: A review of the latest approaches.Sleep Medicine Reviews, 48:101204, December 2019. ISSN 1087-0792. doi: 10.1016/j.smrv.2019.07.007. URLhttps://www.scienc edirect...

  15. [15]

    Warncke, Markus H

    Luigi Fiorillo, Giuliana Monachino, Julia van der Meer, Marco Pesce, Jan D. Warncke, Markus H. Schmidt, Claudio L. A. Bassetti, Athina Tzo- vara, Paolo Favaro, and Francesca D. Faraci. U-Sleep’s resilience to AASM guidelines.npj Digital Medicine, 6(1):1–9, March 2023. ISSN 2398-6352. doi: 10.1038/s41746-023-00784-0. URLhttps://www.nature.com/artic les/s41...

  16. [16]

    Bassetti, Søren Berg, Ludger Grote, Poul Jennum, Patrick Levy, Stefan Mihaicuta, Lino Nobili, Dieter Riemann, F

    Jürgen Fischer, Zoran Dogas, Claudio L. Bassetti, Søren Berg, Ludger Grote, Poul Jennum, Patrick Levy, Stefan Mihaicuta, Lino Nobili, Dieter Riemann, F. Javier Puertas Cuesta, Friedhart Raschke, Debra J. Skene, Neil Stanley, Dirk Pevernagie, Executive Committee (EC) of the Assem- bly of the National Sleep Societies (ANSS), and Board of the European Sleep ...

  17. [17]

    Current status and prospects of automatic sleep stages scoring: Review.Biomedical Engineering Letters, 13(3):247–272, July

    Maksym Gaiduk, Ángel Serrano Alarcón, Ralf Seepold, and Natividad Martínez Madrid. Current status and prospects of automatic sleep stages scoring: Review.Biomedical Engineering Letters, 13(3):247–272, July

  18. [18]

    doi: 10.1007/s13534-023-00299-3

    ISSN 2093-9868. doi: 10.1007/s13534-023-00299-3. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10382458/

  19. [19]

    Gunnarsdottir, Charlene Gamaldo, Rachel Marie Salas, Joshua B

    Kristin M. Gunnarsdottir, Charlene Gamaldo, Rachel Marie Salas, Joshua B. Ewen, Richard P. Allen, Katherine Hu, and Sridevi V. Sarma. A novel sleep stage scoring system: Combining expert-based features with the generalized linear model.Journal of Sleep Research, 29(5):e12991, Oc- tober 2020. ISSN 0962-1105, 1365-2869. doi: 10.1111/jsr.12991. URL https://o...

  20. [20]

    Human-AI Collaboration: From Explainable AI to Co-Creating Meaning.ACIS 2024 Proceedings, December 2024

    Emil Hardarson, Frida Ivarsson, Anna Sigríður Islind, Erna Sif Arnardóttir, and María Óskarsdóttir. Human-AI Collaboration: From Explainable AI to Co-Creating Meaning.ACIS 2024 Proceedings, December 2024. URL https://aisel.aisnet.org/acis2024/148

  21. [21]

    Data-Local Autonomous LLM-Guided Neural Architecture Search for Multiclass Multimodal Time- Series Classification, March 2026

    Emil Hardarson, Luka Biedebach, Ómar Bessi Ómarsson, Teitur Hrólfsson, Anna Sigridur Islind, and María Óskarsdóttir. Data-Local Autonomous LLM-Guided Neural Architecture Search for Multiclass Multimodal Time- Series Classification, March 2026. URLhttp://arxiv.org/abs/2603.1

  22. [22]

    arXiv:2603.15939 [cs]

  23. [23]

    Past and Future of Computer-Assisted Sleep Analysis and Drowsiness Assessment:.Journal of Clinical Neurophysiology, 13(4):295– 313, July 1996

    Joel Hasan. Past and Future of Computer-Assisted Sleep Analysis and Drowsiness Assessment:.Journal of Clinical Neurophysiology, 13(4):295– 313, July 1996. ISSN 0736-0258. doi: 10.1097/00004691-199607000-00004. URLhttp://journals.lww.com/00004691-199607000-00004

  24. [24]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The Curious Case of Neural Text Degeneration, February 2020. URLhttp: //arxiv.org/abs/1904.09751. arXiv:1904.09751 [cs]

  25. [25]

    Explainable Artificial Intelligence (XAI): Concepts and Chal- lenges in Healthcare.AI, 4(3):652–666, September 2023

    Tim Hulsen. Explainable Artificial Intelligence (XAI): Concepts and Chal- lenges in Healthcare.AI, 4(3):652–666, September 2023. ISSN 2673-2688. doi: 10.3390/ai4030034. URLhttps://www.mdpi.com/2673-2688/4/3/

  26. [26]

    The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology, and Techinical Specifications, 1st ed., 2007

    Conrad Iber, Sonia Ancoli-Israel, Andrew Chesson, and Stuart Quan. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology, and Techinical Specifications, 1st ed., 2007

  27. [27]

    Toward a responsible future: rec- ommendations for AI-enabled clinical decision support.Journal of the American Medical Informatics Association, 31(11):2730–2739, November

    Steven Labkoff, Bilikis Oladimeji, Joseph Kannry, Anthony Solomonides, Russell Leftwich, Eileen Koski, Amanda L Joseph, Monica Lopez-Gonzalez, Lee A Fleisher, Kimberly Nolen, Sayon Dutta, Deborah R Levy, Amy Price, Paul J Barr, Jonathan D Hron, Baihan Lin, Gyana Srivastava, Nuria Pastor, Unai Sanchez Luque, Tien Thi Thuy Bui, Reva Singh, Tayler 20 William...

  28. [28]

    doi: 10.1093/jamia/ocae209

    ISSN 1067-5027, 1527-974X. doi: 10.1093/jamia/ocae209. URL https://academic.oup.com/jamia/article/31/11/2730/7776823

  29. [29]

    MNE-Python, November 2025

    Eric Larson, Alexandre Gramfort, Denis A Engemann, Jaakko Leppakan- gas, Christian Brodbeck, Mainak Jas, Teon L Brooks, Jona Sassenhagen, Daniel McCloy, Martin Luessi, Jean-Rémi King, Richard Höchenberger, Clemens Brunner, Roman Goj, Guillaume Favelier, Marijn van Vliet, Mark Wronkiewicz, Stefan Appelhoff, Alex Rockhill, Chris Holdgraf, Mathieu Scheltienn...

  30. [30]

    Yun Ji Lee, Jae Yong Lee, Jae Hoon Cho, and Ji Ho Choi. Interrater reliability of sleep stage scoring: a meta-analysis.Journal of Clinical Sleep Medicine : JCSM : Official Publication of the American Academy of Sleep Medicine, 18(1):193–202, January 2022. ISSN 1550-9389. doi: 10.5664/jc sm.9538. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC8807917/

  31. [31]

    A rule- based automatic sleep staging method.Journal of Neuroscience Methods, 205(1):169–176, March 2012

    Sheng-Fu Liang, Chin-En Kuo, Yu-Han Hu, and Yu-Shian Cheng. A rule- based automatic sleep staging method.Journal of Neuroscience Methods, 205(1):169–176, March 2012. ISSN 0165-0270. doi: 10.1016/j.jneumeth.2 011.12.022. URLhttps://www.sciencedirect.com/science/article/ pii/S016502701100759X

  32. [32]

    Kuna, Ruth Benca, Clete A

    Atul Malhotra, Magdy Younes, Samuel T. Kuna, Ruth Benca, Clete A. Kushida, James Walsh, Alexandra Hanlon, Bethany Staley, Allan I. Pack, and Grace W. Pien. Performance of an automated polysomnography scor- ing system versus computer-assisted manual scoring.Sleep, 36(4):573–582, April 2013. ISSN 1550-9109. doi: 10.5665/sleep.2548

  33. [33]

    Terrill, Heidur Gretarsdottir, Sigridur Sigurdardot- tir, Kristin Anna Olafsdottir, Anna Sigridur Islind, María Óskarsdóttir, Erna Sif Arnardóttir, and Timo Leppänen

    Sami Nikkonen, Pranavan Somaskandhan, Henri Korkalainen, Samu Kain- ulainen, Philip I. Terrill, Heidur Gretarsdottir, Sigridur Sigurdardot- tir, Kristin Anna Olafsdottir, Anna Sigridur Islind, María Óskarsdóttir, Erna Sif Arnardóttir, and Timo Leppänen. Multicentre sleep-stage scoring agreement in the Sleep Revolution project.Journal of Sleep Research, 33...

  34. [34]

    Computer based sleep recording and analysis.Sleep Medicine Reviews, 4(2):131–148, April2000

    Thomas Penzel and Regina Conradt. Computer based sleep recording and analysis.Sleep Medicine Reviews, 4(2):131–148, April2000. ISSN10870792. doi: 10.1053/smrv.1999.0087. URLhttps://linkinghub.elsevier.com/ retrieve/pii/S1087079299900874

  35. [35]

    U-Sleep: resilient high-frequency sleep staging

    MathiasPerslev, SuneDarkner, LykkeKempfner, MikiNikolic, PoulJørgen Jennum, and Christian Igel. U-Sleep: resilient high-frequency sleep staging. 23 npj Digital Medicine, 4(1):72, April 2021. ISSN 2398-6352. doi: 10.1038/ s41746-021-00440-5. URLhttps://www.nature.com/articles/s41746 -021-00440-5

  36. [36]

    Lorenzen, Elisabeth Heremans, Oliver Y

    Huy Phan, Kristian P. Lorenzen, Elisabeth Heremans, Oliver Y. Chén, Minh C. Tran, Philipp Koch, Alfred Mertins, Mathias Baumert, Kaare B. Mikkelsen, and Maarten De Vos. L-SeqSleepNet: Whole-cycle Long Se- quence Modeling for Automatic Sleep Staging.IEEE Journal of Biomedical and Health Informatics, 27(10):4748–4757, October 2023. ISSN 2168-2208. doi: 10.1...

  37. [37]

    University of California, Brain Information Service/Brain Research Institute, Los Ange- les, 1968

    A Rechtschaffen and A Kales.A manual of standardized terminology, tech- niques and scoring system of sleep stages in human subjects. University of California, Brain Information Service/Brain Research Institute, Los Ange- les, 1968

  38. [38]

    Rosenberg and Steven Van Hout

    Richard S. Rosenberg and Steven Van Hout. The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring.Jour- nal of clinical sleep medicine: JCSM: official publication of the American Academy of Sleep Medicine, 9(1):81–87, January 2013. ISSN 1550-9397. doi: 10.5664/jcsm.2350

  39. [39]

    Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead, September

    Cynthia Rudin. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead, September

  40. [40]

    arXiv:1811.10154 [cs, stat]

    URLhttp://arxiv.org/abs/1811.10154. arXiv:1811.10154 [cs, stat]

  41. [41]

    The Future of Sleep Staging, Revisited.Nature and Science of Sleep, 15:313–322, May 2023

    Neil Stanley. The Future of Sleep Staging, Revisited.Nature and Science of Sleep, 15:313–322, May 2023. doi: 10.2147/NSS.S405663

  42. [42]

    Akara Supratak, Hao Dong, Chao Wu, and Yike Guo. DeepSleepNet: a Model for Automatic Sleep Stage Scoring based on Raw Single-Channel EEG.IEEE Transactions on Neural Systems and Rehabilitation Engineer- ing, 25(11):1998–2008, November 2017. ISSN 1534-4320, 1558-0210. doi: 10.1109/TNSRE.2017.2721116. URLhttp://arxiv.org/abs/1703.040

  43. [43]

    arXiv:1703.04046 [stat]

  44. [44]

    Troester, Stuart F

    Matthew M. Troester, Stuart F. Quan, American Academy of Sleep Medicine, and Richard B. Berry.The AASM Manual for the Scoring of Sleep and Associated Events, Version 3. American Academy Of Sleep Medicine, June 2023. ISBN 978-0-9706137-1-4

  45. [45]

    An open-source, high-performance tool for automated sleep staging.eLife, 10:e70092, October 2021

    Raphael Vallat and Matthew P Walker. An open-source, high-performance tool for automated sleep staging.eLife, 10:e70092, October 2021. ISSN 2050-084X. doi: 10.7554/eLife.70092. URLhttps://doi.org/10.7554/ eLife.70092. 24

  46. [46]

    P. Welch. The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified peri- odograms.IEEE Transactions on Audio and Electroacoustics, 15(2):70–73, June 1967. ISSN 1558-2582. doi: 10.1109/TAU.1967.1161901. URL https://ieeexplore.ieee.org/document/1161901

  47. [47]

    A Review on Au- tomated Sleep Study.Annals of Biomedical Engineering, 52(6):1463–1491, June 2024

    Mehran Yazdi, Mahdi Samaee, and Daniel Massicotte. A Review on Au- tomated Sleep Study.Annals of Biomedical Engineering, 52(6):1463–1491, June 2024. ISSN 1573-9686. doi: 10.1007/s10439-024-03486-0. URL https://doi.org/10.1007/s10439-024-03486-0

  48. [48]

    EEG-Based Auto- matic Sleep Staging Using Ontology and Weighting Feature Analysis.Com- putational and Mathematical Methods in Medicine, 2018:1–16, September

    Bingtao Zhang, Tao Lei, Hong Liu, and Hanshu Cai. EEG-Based Auto- matic Sleep Staging Using Ontology and Weighting Feature Analysis.Com- putational and Mathematical Methods in Medicine, 2018:1–16, September

  49. [49]

    doi: 10.1155/2018/6534041

    ISSN 1748-670X, 1748-6718. doi: 10.1155/2018/6534041. URL https://www.hindawi.com/journals/cmmm/2018/6534041/. 25 0 5000 10000 15000 20000Number of epochs Method disagrees Method agrees 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Human agreement ratio 0.0 0.5 1.0Proportion Figure 7: Distribution of human inter-scorer agreement for epochs where the rule-based algorith...