pith. sign in

arxiv: 1906.09936 · v1 · pith:57ZJT6UInew · submitted 2019-06-20 · 📡 eess.SP · cs.LG· stat.ML

AI vs Humans for the diagnosis of sleep apnea

Pith reviewed 2026-05-25 19:30 UTC · model grok-4.3

classification 📡 eess.SP cs.LGstat.ML
keywords sleep apneaobstructive sleep apneapolysomnographydeep learningevent detectionOSA diagnosisartificial intelligenceinter-scorer variability
0
0 comments X

The pith

An adapted deep learning method reaches 81% accuracy diagnosing sleep apnea severity compared to 75% for human experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a deep learning detector can match trained sleep experts at diagnosing obstructive sleep apnea from polysomnography recordings. It adapts the DOSED method to identify breathing events on a set of 52 recordings scored independently by five experts. The automatic system classified severity at 81% accuracy and detected events at an average F1 of 0.57, while experts averaged 75% accuracy and 0.55 F1. A reader would care because expert scoring is slow and shows high variability between scorers, so a reliable automatic alternative could make diagnosis faster and more consistent.

Core claim

We adapted our state-of-the-art deep learning method for sleep event detection, DOSED, to the detection of sleep breathing events in PSG for the diagnosis of OSA. We used a dataset of 52 PSG recordings with apnea-hypopnea event scoring from 5 trained sleep experts. We observed that human sleep experts reached an average accuracy of 75% while the automatic approach reached 81% for sleep apnea severity diagnosis. The F1 score for individual event detection was 0.55 for experts and 0.57 for the automatic approach, on average. These results demonstrate that the automatic approach can perform at a sleep expert level for the diagnosis of OSA.

What carries the argument

The DOSED deep learning model adapted to detect apnea and hypopnea events in polysomnography signals and classify OSA severity.

If this is right

  • Automatic methods can classify OSA severity at least as accurately as human experts on the tested data.
  • Individual apnea-hypopnea event detection reaches F1 scores comparable to those of experts.
  • Deep learning can be applied to reduce the time and variability of manual PSG scoring.
  • The approach supports replacing or assisting experts for routine OSA diagnosis tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Routine use of such detectors could shorten the time between sleep study and diagnosis in busy clinics.
  • Combining the automatic output with selective expert review might raise overall reliability further.
  • Testing the same method on recordings from different equipment or patient groups would check whether the expert-level result holds more widely.

Load-bearing premise

Performance measured on this single collection of 52 recordings scored by five experts is enough to conclude that the automatic method reaches expert level in general.

What would settle it

An independent test on a new set of recordings from multiple centers where the automatic method scores below 75% accuracy or shows statistically worse event detection than the expert average would disprove the claim.

Figures

Figures reproduced from arXiv: 1906.09936 by Albert Bou Hernandez, Emmanuel H. During, Pierrick J. Arnal, Valentin Thorey.

Figure 1
Figure 1. Figure 1: DOSED during prediction. We consider an airflow [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Top: Precision and Recall for each scorer and the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Polysomnography (PSG) is the gold standard for diagnosing sleep obstructive apnea (OSA). It allows monitoring of breathing events throughout the night. The detection of these events is usually done by trained sleep experts. However, this task is tedious, highly time-consuming and subject to important inter-scorer variability. In this study, we adapted our state-of-the-art deep learning method for sleep event detection, DOSED, to the detection of sleep breathing events in PSG for the diagnosis of OSA. We used a dataset of 52 PSG recordings with apnea-hypopnea event scoring from 5 trained sleep experts. We assessed the performance of the automatic approach and compared it to the inter-scorer performance for both the diagnosis of OSA severity and, at the microscale, for the detection of single breathing events. We observed that human sleep experts reached an average accuracy of 75\% while the automatic approach reached 81\% for sleep apnea severity diagnosis. The F1 score for individual event detection was 0.55 for experts and 0.57 for the automatic approach, on average. These results demonstrate that the automatic approach can perform at a sleep expert level for the diagnosis of OSA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper adapts the DOSED deep learning model for detecting sleep breathing events in polysomnography (PSG) recordings and compares its performance to that of five human sleep experts on a dataset of 52 recordings. It reports that the automatic method achieves 81% accuracy for OSA severity diagnosis (vs. 75% average for experts) and 0.57 average F1 for individual event detection (vs. 0.55 for experts), concluding that the automatic approach performs at expert level.

Significance. If the evaluation protocol is sound and the small observed differences are robust, the work would provide evidence that deep learning can match inter-expert agreement levels in OSA diagnosis, which is clinically relevant given known scorer variability. The use of multiple independent expert scorings as the reference standard is a positive design choice that avoids circularity.

major comments (2)
  1. [Abstract] Abstract: The reported performance numbers for the automatic approach (81% accuracy, 0.57 F1) are given without any description of the training procedure, validation splits, hyperparameter selection, or whether the 52 recordings were used only for testing, in cross-validation, or with possible overlap from prior DOSED training. This information is load-bearing for the central claim that the method reaches expert level.
  2. [Abstract] Abstract/Results: No statistical significance tests, confidence intervals, or variability estimates are provided for the differences between AI and expert performance (6 percentage points in accuracy, 0.02 in F1) on the small set of 52 recordings scored by 5 experts. Given known inter-scorer variability, it is unclear whether the modest edge is reliable or could arise from chance or scorer differences.
minor comments (1)
  1. [Abstract] The abstract states aggregate accuracy and F1 but does not clarify how severity diagnosis is derived from event detections (e.g., AHI thresholds or per-recording aggregation).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important points for improving the clarity and statistical rigor of the manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported performance numbers for the automatic approach (81% accuracy, 0.57 F1) are given without any description of the training procedure, validation splits, hyperparameter selection, or whether the 52 recordings were used only for testing, in cross-validation, or with possible overlap from prior DOSED training. This information is load-bearing for the central claim that the method reaches expert level.

    Authors: We agree that the abstract should explicitly summarize the evaluation protocol to support the central claim. The 52 recordings constitute a held-out test set with no overlap from the data used to develop or tune DOSED; full details of the training procedure, cross-validation, and hyperparameter selection appear in the Methods section. We will revise the abstract to include a concise statement of this protocol. revision: yes

  2. Referee: [Abstract] Abstract/Results: No statistical significance tests, confidence intervals, or variability estimates are provided for the differences between AI and expert performance (6 percentage points in accuracy, 0.02 in F1) on the small set of 52 recordings scored by 5 experts. Given known inter-scorer variability, it is unclear whether the modest edge is reliable or could arise from chance or scorer differences.

    Authors: We acknowledge the absence of statistical tests or confidence intervals in the current version. We will add bootstrap confidence intervals around the performance metrics and a paired statistical comparison (e.g., permutation test) between the automatic method and the expert panel to assess whether the observed differences are robust given the sample size and inter-scorer variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison uses independent expert scorings as reference standard

full rationale

The paper's central claim rests on a direct empirical comparison of the adapted DOSED model against five human experts' scorings on the same 52 PSG recordings, with performance metrics (accuracy for severity diagnosis, F1 for event detection) computed against those independent human labels. No equations, fitted parameters, or self-citations are shown to define the target metrics in terms of the model's own outputs or prior results; the evaluation protocol treats expert annotations as an external reference. Self-citation to the original DOSED work is present but not load-bearing for the reported performance numbers or the 'expert level' conclusion, which derives from the new dataset comparison rather than reducing to prior definitions by construction. This matches the default expectation of no circularity for an empirical methods paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper adapts a prior deep learning model to a new clinical task using standard supervised learning assumptions on a small labeled dataset; no explicit free parameters beyond typical neural network weights are described.

free parameters (1)
  • DOSED model parameters
    Neural network weights and hyperparameters fitted during adaptation to the 52 PSG recordings.
axioms (1)
  • domain assumption The scorings by the five experts constitute a reliable benchmark for human-level performance
    Used directly as the comparison baseline for both severity diagnosis and event detection.

pith-pipeline@v0.9.0 · 5752 in / 1297 out tokens · 34957 ms · 2026-05-25T19:30:48.925348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Expert-level sleep scoring with deep neural networks

    Siddharth Biswal et al. “Expert-level sleep scoring with deep neural networks”. In: Journal of the Amer- ican Medical Informatics Association (2018)

  2. [2]

    A Deep Learning Architec- ture to Detect Events in EEG Signals During Sleep

    Stanislas Chambon et al. “A Deep Learning Architec- ture to Detect Events in EEG Signals During Sleep”. In: 2018 IEEE 28th International Workshop on Ma- chine Learning for Signal Processing (MLSP) . Sept. 2018, pp. 1–6

  3. [3]

    DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal

    Stanislas Chambon et al. “DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal”. In: arXiv preprint arXiv:1812.04079 (2018)

  4. [4]

    Recurrent neural network based classification of ecg signal features for obstruction of sleep apnea detection

    Maowei Cheng et al. “Recurrent neural network based classification of ecg signal features for obstruction of sleep apnea detection”. In: Computational Science and Engineering (CSE) and Embedded and Ubiquitous Computing (EUC), 2017 IEEE International Confer- ence on . V ol. 2. IEEE. 2017, pp. 199–202

  5. [5]

    Real-time apnea-hypopnea event detection during sleep by convolutional neural net- works

    Sang Ho Choi et al. “Real-time apnea-hypopnea event detection during sleep by convolutional neural net- works”. In: Computers in biology and medicine 100 (2018), pp. 123–131

  6. [6]

    Large Neural Network Based Detection of Apnea, Bradycardia and Desaturation Events

    Antoine Honore et al. “Large Neural Network Based Detection of Apnea, Bradycardia and Desaturation Events”. In: arXiv preprint arXiv:1711.06484 (2017)

  7. [7]

    A novel method to precisely detect apnea and hypopnea events by airflow and oximetry signals

    Wu Huang et al. “A novel method to precisely detect apnea and hypopnea events by airflow and oximetry signals”. In: Computers in biology and medicine 88 (2017), pp. 32–40

  8. [8]

    Support vector machines for automated recognition of obstructive sleep apnea syndrome from ECG recordings

    Ahsan H Khandoker et al. “Support vector machines for automated recognition of obstructive sleep apnea syndrome from ECG recordings”. In: IEEE transac- tions on information technology in biomedicine 13.1 (2009), pp. 37–48

  9. [9]

    Automatic differentiation in Py- Torch

    Adam Paszke et al. “Automatic differentiation in Py- Torch”. In: NIPS-W. 2017

  10. [10]

    Increased prevalence of sleep- disordered breathing in adults

    Paul E Peppard et al. “Increased prevalence of sleep- disordered breathing in adults”. In: American journal of epidemiology 177.9 (2013), pp. 1006–1014

  11. [11]

    You only look once: Unified, real-time object detection

    Joseph Redmon et al. “You only look once: Unified, real-time object detection”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 779–788

  12. [12]

    The American Academy of Sleep Medicine Inter-scorer Reliability program: respiratory events

    Richard S. Rosenberg and Steven Van Hout. “The American Academy of Sleep Medicine Inter-scorer Reliability program: respiratory events”. In: J Clin Sleep Med 10.4 (July 2014), pp. 447–454

  13. [13]

    Hidden Health Crisis Costing America Billions. Un- derdiagnosing and Undertreating Obstructive Sleep Apnea Draining Healthcare System

    American Academy of Sleep Medicine (AASM). “Hidden Health Crisis Costing America Billions. Un- derdiagnosing and Undertreating Obstructive Sleep Apnea Draining Healthcare System”. In: (2016)

  14. [14]

    An obstructive sleep ap- nea detection approach using a discriminative hid- den Markov model from ECG signals

    Changyue Song et al. “An obstructive sleep ap- nea detection approach using a discriminative hid- den Markov model from ECG signals”. In: IEEE Transactions on Biomedical Engineering 63.7 (2016), pp. 1532–1542

  15. [15]

    Real-Time Sleep Apnea Detec- tion by Classifier Combination

    B. Xie and H. Minn. “Real-Time Sleep Apnea Detec- tion by Classifier Combination”. In: IEEE Transac- tions on Information Technology in Biomedicine 16.3 (May 2012), pp. 469–477