AI vs Humans for the diagnosis of sleep apnea

Albert Bou Hernandez; Emmanuel H. During; Pierrick J. Arnal; Valentin Thorey

arxiv: 1906.09936 · v1 · pith:57ZJT6UInew · submitted 2019-06-20 · 📡 eess.SP · cs.LG· stat.ML

AI vs Humans for the diagnosis of sleep apnea

Valentin Thorey , Albert Bou Hernandez , Pierrick J. Arnal , Emmanuel H. During This is my paper

Pith reviewed 2026-05-25 19:30 UTC · model grok-4.3

classification 📡 eess.SP cs.LGstat.ML

keywords sleep apneaobstructive sleep apneapolysomnographydeep learningevent detectionOSA diagnosisartificial intelligenceinter-scorer variability

0 comments

The pith

An adapted deep learning method reaches 81% accuracy diagnosing sleep apnea severity compared to 75% for human experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a deep learning detector can match trained sleep experts at diagnosing obstructive sleep apnea from polysomnography recordings. It adapts the DOSED method to identify breathing events on a set of 52 recordings scored independently by five experts. The automatic system classified severity at 81% accuracy and detected events at an average F1 of 0.57, while experts averaged 75% accuracy and 0.55 F1. A reader would care because expert scoring is slow and shows high variability between scorers, so a reliable automatic alternative could make diagnosis faster and more consistent.

Core claim

We adapted our state-of-the-art deep learning method for sleep event detection, DOSED, to the detection of sleep breathing events in PSG for the diagnosis of OSA. We used a dataset of 52 PSG recordings with apnea-hypopnea event scoring from 5 trained sleep experts. We observed that human sleep experts reached an average accuracy of 75% while the automatic approach reached 81% for sleep apnea severity diagnosis. The F1 score for individual event detection was 0.55 for experts and 0.57 for the automatic approach, on average. These results demonstrate that the automatic approach can perform at a sleep expert level for the diagnosis of OSA.

What carries the argument

The DOSED deep learning model adapted to detect apnea and hypopnea events in polysomnography signals and classify OSA severity.

If this is right

Automatic methods can classify OSA severity at least as accurately as human experts on the tested data.
Individual apnea-hypopnea event detection reaches F1 scores comparable to those of experts.
Deep learning can be applied to reduce the time and variability of manual PSG scoring.
The approach supports replacing or assisting experts for routine OSA diagnosis tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Routine use of such detectors could shorten the time between sleep study and diagnosis in busy clinics.
Combining the automatic output with selective expert review might raise overall reliability further.
Testing the same method on recordings from different equipment or patient groups would check whether the expert-level result holds more widely.

Load-bearing premise

Performance measured on this single collection of 52 recordings scored by five experts is enough to conclude that the automatic method reaches expert level in general.

What would settle it

An independent test on a new set of recordings from multiple centers where the automatic method scores below 75% accuracy or shows statistically worse event detection than the expert average would disprove the claim.

Figures

Figures reproduced from arXiv: 1906.09936 by Albert Bou Hernandez, Emmanuel H. During, Pierrick J. Arnal, Valentin Thorey.

**Figure 2.** Figure 2: Top: Precision and Recall for each scorer and the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Polysomnography (PSG) is the gold standard for diagnosing sleep obstructive apnea (OSA). It allows monitoring of breathing events throughout the night. The detection of these events is usually done by trained sleep experts. However, this task is tedious, highly time-consuming and subject to important inter-scorer variability. In this study, we adapted our state-of-the-art deep learning method for sleep event detection, DOSED, to the detection of sleep breathing events in PSG for the diagnosis of OSA. We used a dataset of 52 PSG recordings with apnea-hypopnea event scoring from 5 trained sleep experts. We assessed the performance of the automatic approach and compared it to the inter-scorer performance for both the diagnosis of OSA severity and, at the microscale, for the detection of single breathing events. We observed that human sleep experts reached an average accuracy of 75\% while the automatic approach reached 81\% for sleep apnea severity diagnosis. The F1 score for individual event detection was 0.55 for experts and 0.57 for the automatic approach, on average. These results demonstrate that the automatic approach can perform at a sleep expert level for the diagnosis of OSA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DOSED edges experts by a few points on 52 recordings but the evaluation protocol is not described at all

read the letter

The one thing to know is that this paper reports their DOSED model reaching 81% accuracy and 0.57 F1 on OSA severity and event detection while five experts averaged 75% and 0.55 on the same 52 PSG recordings. Those exact numbers on this dataset are new empirical results. The work does a reasonable job of using inter-expert agreement as the benchmark instead of pretending there is a perfect gold standard, which is the right framing for this task. It also applies an existing architecture without claiming to invent a new one. The soft spots are straightforward. The abstract gives zero information on whether the 52 recordings were used only for testing, how the model was trained or fine-tuned here, what splits or cross-validation were applied, or any statistical test for the small observed gaps. With n=52 those gaps could easily be noise, and the claim that the automatic approach reaches expert level rests on this thin base. The stress-test note is accurate on this point. The paper is for researchers already working on automated PSG scoring who want to see one more head-to-head comparison. A reader in that niche might extract the numbers for context, but the missing protocol limits how much weight the result can carry. I would bring it to a reading group as a maybe to talk through the evaluation gaps. I would not cite it in my own work. It deserves peer review because the core comparison is sensible and the data is real, even though the current version needs the methods filled in before the numbers can be taken seriously.

Referee Report

2 major / 1 minor

Summary. The paper adapts the DOSED deep learning model for detecting sleep breathing events in polysomnography (PSG) recordings and compares its performance to that of five human sleep experts on a dataset of 52 recordings. It reports that the automatic method achieves 81% accuracy for OSA severity diagnosis (vs. 75% average for experts) and 0.57 average F1 for individual event detection (vs. 0.55 for experts), concluding that the automatic approach performs at expert level.

Significance. If the evaluation protocol is sound and the small observed differences are robust, the work would provide evidence that deep learning can match inter-expert agreement levels in OSA diagnosis, which is clinically relevant given known scorer variability. The use of multiple independent expert scorings as the reference standard is a positive design choice that avoids circularity.

major comments (2)

[Abstract] Abstract: The reported performance numbers for the automatic approach (81% accuracy, 0.57 F1) are given without any description of the training procedure, validation splits, hyperparameter selection, or whether the 52 recordings were used only for testing, in cross-validation, or with possible overlap from prior DOSED training. This information is load-bearing for the central claim that the method reaches expert level.
[Abstract] Abstract/Results: No statistical significance tests, confidence intervals, or variability estimates are provided for the differences between AI and expert performance (6 percentage points in accuracy, 0.02 in F1) on the small set of 52 recordings scored by 5 experts. Given known inter-scorer variability, it is unclear whether the modest edge is reliable or could arise from chance or scorer differences.

minor comments (1)

[Abstract] The abstract states aggregate accuracy and F1 but does not clarify how severity diagnosis is derived from event detections (e.g., AHI thresholds or per-recording aggregation).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important points for improving the clarity and statistical rigor of the manuscript. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The reported performance numbers for the automatic approach (81% accuracy, 0.57 F1) are given without any description of the training procedure, validation splits, hyperparameter selection, or whether the 52 recordings were used only for testing, in cross-validation, or with possible overlap from prior DOSED training. This information is load-bearing for the central claim that the method reaches expert level.

Authors: We agree that the abstract should explicitly summarize the evaluation protocol to support the central claim. The 52 recordings constitute a held-out test set with no overlap from the data used to develop or tune DOSED; full details of the training procedure, cross-validation, and hyperparameter selection appear in the Methods section. We will revise the abstract to include a concise statement of this protocol. revision: yes
Referee: [Abstract] Abstract/Results: No statistical significance tests, confidence intervals, or variability estimates are provided for the differences between AI and expert performance (6 percentage points in accuracy, 0.02 in F1) on the small set of 52 recordings scored by 5 experts. Given known inter-scorer variability, it is unclear whether the modest edge is reliable or could arise from chance or scorer differences.

Authors: We acknowledge the absence of statistical tests or confidence intervals in the current version. We will add bootstrap confidence intervals around the performance metrics and a paired statistical comparison (e.g., permutation test) between the automatic method and the expert panel to assess whether the observed differences are robust given the sample size and inter-scorer variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison uses independent expert scorings as reference standard

full rationale

The paper's central claim rests on a direct empirical comparison of the adapted DOSED model against five human experts' scorings on the same 52 PSG recordings, with performance metrics (accuracy for severity diagnosis, F1 for event detection) computed against those independent human labels. No equations, fitted parameters, or self-citations are shown to define the target metrics in terms of the model's own outputs or prior results; the evaluation protocol treats expert annotations as an external reference. Self-citation to the original DOSED work is present but not load-bearing for the reported performance numbers or the 'expert level' conclusion, which derives from the new dataset comparison rather than reducing to prior definitions by construction. This matches the default expectation of no circularity for an empirical methods paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper adapts a prior deep learning model to a new clinical task using standard supervised learning assumptions on a small labeled dataset; no explicit free parameters beyond typical neural network weights are described.

free parameters (1)

DOSED model parameters
Neural network weights and hyperparameters fitted during adaptation to the 52 PSG recordings.

axioms (1)

domain assumption The scorings by the five experts constitute a reliable benchmark for human-level performance
Used directly as the comparison baseline for both severity diagnosis and event detection.

pith-pipeline@v0.9.0 · 5752 in / 1297 out tokens · 34957 ms · 2026-05-25T19:30:48.925348+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adapted our state-of-the-art deep learning method for sleep event detection, DOSED, to the detection of sleep breathing events in PSG... leave-one-out cross-validation... F1 score... 0.55 for experts and 0.57 for the automatic approach
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The convolutional network architecture... feature extraction blocks... localization module and classification module

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Expert-level sleep scoring with deep neural networks

Siddharth Biswal et al. “Expert-level sleep scoring with deep neural networks”. In: Journal of the Amer- ican Medical Informatics Association (2018)

work page 2018
[2]

A Deep Learning Architec- ture to Detect Events in EEG Signals During Sleep

Stanislas Chambon et al. “A Deep Learning Architec- ture to Detect Events in EEG Signals During Sleep”. In: 2018 IEEE 28th International Workshop on Ma- chine Learning for Signal Processing (MLSP) . Sept. 2018, pp. 1–6

work page 2018
[3]

DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal

Stanislas Chambon et al. “DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal”. In: arXiv preprint arXiv:1812.04079 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Recurrent neural network based classiﬁcation of ecg signal features for obstruction of sleep apnea detection

Maowei Cheng et al. “Recurrent neural network based classiﬁcation of ecg signal features for obstruction of sleep apnea detection”. In: Computational Science and Engineering (CSE) and Embedded and Ubiquitous Computing (EUC), 2017 IEEE International Confer- ence on . V ol. 2. IEEE. 2017, pp. 199–202

work page 2017
[5]

Real-time apnea-hypopnea event detection during sleep by convolutional neural net- works

Sang Ho Choi et al. “Real-time apnea-hypopnea event detection during sleep by convolutional neural net- works”. In: Computers in biology and medicine 100 (2018), pp. 123–131

work page 2018
[6]

Large Neural Network Based Detection of Apnea, Bradycardia and Desaturation Events

Antoine Honore et al. “Large Neural Network Based Detection of Apnea, Bradycardia and Desaturation Events”. In: arXiv preprint arXiv:1711.06484 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

A novel method to precisely detect apnea and hypopnea events by airﬂow and oximetry signals

Wu Huang et al. “A novel method to precisely detect apnea and hypopnea events by airﬂow and oximetry signals”. In: Computers in biology and medicine 88 (2017), pp. 32–40

work page 2017
[8]

Support vector machines for automated recognition of obstructive sleep apnea syndrome from ECG recordings

Ahsan H Khandoker et al. “Support vector machines for automated recognition of obstructive sleep apnea syndrome from ECG recordings”. In: IEEE transac- tions on information technology in biomedicine 13.1 (2009), pp. 37–48

work page 2009
[9]

Automatic differentiation in Py- Torch

Adam Paszke et al. “Automatic differentiation in Py- Torch”. In: NIPS-W. 2017

work page 2017
[10]

Increased prevalence of sleep- disordered breathing in adults

Paul E Peppard et al. “Increased prevalence of sleep- disordered breathing in adults”. In: American journal of epidemiology 177.9 (2013), pp. 1006–1014

work page 2013
[11]

You only look once: Uniﬁed, real-time object detection

Joseph Redmon et al. “You only look once: Uniﬁed, real-time object detection”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 779–788

work page 2016
[12]

The American Academy of Sleep Medicine Inter-scorer Reliability program: respiratory events

Richard S. Rosenberg and Steven Van Hout. “The American Academy of Sleep Medicine Inter-scorer Reliability program: respiratory events”. In: J Clin Sleep Med 10.4 (July 2014), pp. 447–454

work page 2014
[13]

Hidden Health Crisis Costing America Billions. Un- derdiagnosing and Undertreating Obstructive Sleep Apnea Draining Healthcare System

American Academy of Sleep Medicine (AASM). “Hidden Health Crisis Costing America Billions. Un- derdiagnosing and Undertreating Obstructive Sleep Apnea Draining Healthcare System”. In: (2016)

work page 2016
[14]

An obstructive sleep ap- nea detection approach using a discriminative hid- den Markov model from ECG signals

Changyue Song et al. “An obstructive sleep ap- nea detection approach using a discriminative hid- den Markov model from ECG signals”. In: IEEE Transactions on Biomedical Engineering 63.7 (2016), pp. 1532–1542

work page 2016
[15]

Real-Time Sleep Apnea Detec- tion by Classiﬁer Combination

B. Xie and H. Minn. “Real-Time Sleep Apnea Detec- tion by Classiﬁer Combination”. In: IEEE Transac- tions on Information Technology in Biomedicine 16.3 (May 2012), pp. 469–477

work page 2012

[1] [1]

Expert-level sleep scoring with deep neural networks

Siddharth Biswal et al. “Expert-level sleep scoring with deep neural networks”. In: Journal of the Amer- ican Medical Informatics Association (2018)

work page 2018

[2] [2]

A Deep Learning Architec- ture to Detect Events in EEG Signals During Sleep

Stanislas Chambon et al. “A Deep Learning Architec- ture to Detect Events in EEG Signals During Sleep”. In: 2018 IEEE 28th International Workshop on Ma- chine Learning for Signal Processing (MLSP) . Sept. 2018, pp. 1–6

work page 2018

[3] [3]

DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal

Stanislas Chambon et al. “DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal”. In: arXiv preprint arXiv:1812.04079 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Recurrent neural network based classiﬁcation of ecg signal features for obstruction of sleep apnea detection

Maowei Cheng et al. “Recurrent neural network based classiﬁcation of ecg signal features for obstruction of sleep apnea detection”. In: Computational Science and Engineering (CSE) and Embedded and Ubiquitous Computing (EUC), 2017 IEEE International Confer- ence on . V ol. 2. IEEE. 2017, pp. 199–202

work page 2017

[5] [5]

Real-time apnea-hypopnea event detection during sleep by convolutional neural net- works

Sang Ho Choi et al. “Real-time apnea-hypopnea event detection during sleep by convolutional neural net- works”. In: Computers in biology and medicine 100 (2018), pp. 123–131

work page 2018

[6] [6]

Large Neural Network Based Detection of Apnea, Bradycardia and Desaturation Events

Antoine Honore et al. “Large Neural Network Based Detection of Apnea, Bradycardia and Desaturation Events”. In: arXiv preprint arXiv:1711.06484 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

A novel method to precisely detect apnea and hypopnea events by airﬂow and oximetry signals

Wu Huang et al. “A novel method to precisely detect apnea and hypopnea events by airﬂow and oximetry signals”. In: Computers in biology and medicine 88 (2017), pp. 32–40

work page 2017

[8] [8]

Support vector machines for automated recognition of obstructive sleep apnea syndrome from ECG recordings

Ahsan H Khandoker et al. “Support vector machines for automated recognition of obstructive sleep apnea syndrome from ECG recordings”. In: IEEE transac- tions on information technology in biomedicine 13.1 (2009), pp. 37–48

work page 2009

[9] [9]

Automatic differentiation in Py- Torch

Adam Paszke et al. “Automatic differentiation in Py- Torch”. In: NIPS-W. 2017

work page 2017

[10] [10]

Increased prevalence of sleep- disordered breathing in adults

Paul E Peppard et al. “Increased prevalence of sleep- disordered breathing in adults”. In: American journal of epidemiology 177.9 (2013), pp. 1006–1014

work page 2013

[11] [11]

You only look once: Uniﬁed, real-time object detection

Joseph Redmon et al. “You only look once: Uniﬁed, real-time object detection”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 779–788

work page 2016

[12] [12]

The American Academy of Sleep Medicine Inter-scorer Reliability program: respiratory events

Richard S. Rosenberg and Steven Van Hout. “The American Academy of Sleep Medicine Inter-scorer Reliability program: respiratory events”. In: J Clin Sleep Med 10.4 (July 2014), pp. 447–454

work page 2014

[13] [13]

Hidden Health Crisis Costing America Billions. Un- derdiagnosing and Undertreating Obstructive Sleep Apnea Draining Healthcare System

American Academy of Sleep Medicine (AASM). “Hidden Health Crisis Costing America Billions. Un- derdiagnosing and Undertreating Obstructive Sleep Apnea Draining Healthcare System”. In: (2016)

work page 2016

[14] [14]

An obstructive sleep ap- nea detection approach using a discriminative hid- den Markov model from ECG signals

Changyue Song et al. “An obstructive sleep ap- nea detection approach using a discriminative hid- den Markov model from ECG signals”. In: IEEE Transactions on Biomedical Engineering 63.7 (2016), pp. 1532–1542

work page 2016

[15] [15]

Real-Time Sleep Apnea Detec- tion by Classiﬁer Combination

B. Xie and H. Minn. “Real-Time Sleep Apnea Detec- tion by Classiﬁer Combination”. In: IEEE Transac- tions on Information Technology in Biomedicine 16.3 (May 2012), pp. 469–477

work page 2012