AI vs Humans for the diagnosis of sleep apnea
Pith reviewed 2026-05-25 19:30 UTC · model grok-4.3
The pith
An adapted deep learning method reaches 81% accuracy diagnosing sleep apnea severity compared to 75% for human experts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We adapted our state-of-the-art deep learning method for sleep event detection, DOSED, to the detection of sleep breathing events in PSG for the diagnosis of OSA. We used a dataset of 52 PSG recordings with apnea-hypopnea event scoring from 5 trained sleep experts. We observed that human sleep experts reached an average accuracy of 75% while the automatic approach reached 81% for sleep apnea severity diagnosis. The F1 score for individual event detection was 0.55 for experts and 0.57 for the automatic approach, on average. These results demonstrate that the automatic approach can perform at a sleep expert level for the diagnosis of OSA.
What carries the argument
The DOSED deep learning model adapted to detect apnea and hypopnea events in polysomnography signals and classify OSA severity.
If this is right
- Automatic methods can classify OSA severity at least as accurately as human experts on the tested data.
- Individual apnea-hypopnea event detection reaches F1 scores comparable to those of experts.
- Deep learning can be applied to reduce the time and variability of manual PSG scoring.
- The approach supports replacing or assisting experts for routine OSA diagnosis tasks.
Where Pith is reading between the lines
- Routine use of such detectors could shorten the time between sleep study and diagnosis in busy clinics.
- Combining the automatic output with selective expert review might raise overall reliability further.
- Testing the same method on recordings from different equipment or patient groups would check whether the expert-level result holds more widely.
Load-bearing premise
Performance measured on this single collection of 52 recordings scored by five experts is enough to conclude that the automatic method reaches expert level in general.
What would settle it
An independent test on a new set of recordings from multiple centers where the automatic method scores below 75% accuracy or shows statistically worse event detection than the expert average would disprove the claim.
Figures
read the original abstract
Polysomnography (PSG) is the gold standard for diagnosing sleep obstructive apnea (OSA). It allows monitoring of breathing events throughout the night. The detection of these events is usually done by trained sleep experts. However, this task is tedious, highly time-consuming and subject to important inter-scorer variability. In this study, we adapted our state-of-the-art deep learning method for sleep event detection, DOSED, to the detection of sleep breathing events in PSG for the diagnosis of OSA. We used a dataset of 52 PSG recordings with apnea-hypopnea event scoring from 5 trained sleep experts. We assessed the performance of the automatic approach and compared it to the inter-scorer performance for both the diagnosis of OSA severity and, at the microscale, for the detection of single breathing events. We observed that human sleep experts reached an average accuracy of 75\% while the automatic approach reached 81\% for sleep apnea severity diagnosis. The F1 score for individual event detection was 0.55 for experts and 0.57 for the automatic approach, on average. These results demonstrate that the automatic approach can perform at a sleep expert level for the diagnosis of OSA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper adapts the DOSED deep learning model for detecting sleep breathing events in polysomnography (PSG) recordings and compares its performance to that of five human sleep experts on a dataset of 52 recordings. It reports that the automatic method achieves 81% accuracy for OSA severity diagnosis (vs. 75% average for experts) and 0.57 average F1 for individual event detection (vs. 0.55 for experts), concluding that the automatic approach performs at expert level.
Significance. If the evaluation protocol is sound and the small observed differences are robust, the work would provide evidence that deep learning can match inter-expert agreement levels in OSA diagnosis, which is clinically relevant given known scorer variability. The use of multiple independent expert scorings as the reference standard is a positive design choice that avoids circularity.
major comments (2)
- [Abstract] Abstract: The reported performance numbers for the automatic approach (81% accuracy, 0.57 F1) are given without any description of the training procedure, validation splits, hyperparameter selection, or whether the 52 recordings were used only for testing, in cross-validation, or with possible overlap from prior DOSED training. This information is load-bearing for the central claim that the method reaches expert level.
- [Abstract] Abstract/Results: No statistical significance tests, confidence intervals, or variability estimates are provided for the differences between AI and expert performance (6 percentage points in accuracy, 0.02 in F1) on the small set of 52 recordings scored by 5 experts. Given known inter-scorer variability, it is unclear whether the modest edge is reliable or could arise from chance or scorer differences.
minor comments (1)
- [Abstract] The abstract states aggregate accuracy and F1 but does not clarify how severity diagnosis is derived from event detections (e.g., AHI thresholds or per-recording aggregation).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important points for improving the clarity and statistical rigor of the manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported performance numbers for the automatic approach (81% accuracy, 0.57 F1) are given without any description of the training procedure, validation splits, hyperparameter selection, or whether the 52 recordings were used only for testing, in cross-validation, or with possible overlap from prior DOSED training. This information is load-bearing for the central claim that the method reaches expert level.
Authors: We agree that the abstract should explicitly summarize the evaluation protocol to support the central claim. The 52 recordings constitute a held-out test set with no overlap from the data used to develop or tune DOSED; full details of the training procedure, cross-validation, and hyperparameter selection appear in the Methods section. We will revise the abstract to include a concise statement of this protocol. revision: yes
-
Referee: [Abstract] Abstract/Results: No statistical significance tests, confidence intervals, or variability estimates are provided for the differences between AI and expert performance (6 percentage points in accuracy, 0.02 in F1) on the small set of 52 recordings scored by 5 experts. Given known inter-scorer variability, it is unclear whether the modest edge is reliable or could arise from chance or scorer differences.
Authors: We acknowledge the absence of statistical tests or confidence intervals in the current version. We will add bootstrap confidence intervals around the performance metrics and a paired statistical comparison (e.g., permutation test) between the automatic method and the expert panel to assess whether the observed differences are robust given the sample size and inter-scorer variability. revision: yes
Circularity Check
No significant circularity; empirical comparison uses independent expert scorings as reference standard
full rationale
The paper's central claim rests on a direct empirical comparison of the adapted DOSED model against five human experts' scorings on the same 52 PSG recordings, with performance metrics (accuracy for severity diagnosis, F1 for event detection) computed against those independent human labels. No equations, fitted parameters, or self-citations are shown to define the target metrics in terms of the model's own outputs or prior results; the evaluation protocol treats expert annotations as an external reference. Self-citation to the original DOSED work is present but not load-bearing for the reported performance numbers or the 'expert level' conclusion, which derives from the new dataset comparison rather than reducing to prior definitions by construction. This matches the default expectation of no circularity for an empirical methods paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- DOSED model parameters
axioms (1)
- domain assumption The scorings by the five experts constitute a reliable benchmark for human-level performance
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We adapted our state-of-the-art deep learning method for sleep event detection, DOSED, to the detection of sleep breathing events in PSG... leave-one-out cross-validation... F1 score... 0.55 for experts and 0.57 for the automatic approach
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The convolutional network architecture... feature extraction blocks... localization module and classification module
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Expert-level sleep scoring with deep neural networks
Siddharth Biswal et al. “Expert-level sleep scoring with deep neural networks”. In: Journal of the Amer- ican Medical Informatics Association (2018)
work page 2018
-
[2]
A Deep Learning Architec- ture to Detect Events in EEG Signals During Sleep
Stanislas Chambon et al. “A Deep Learning Architec- ture to Detect Events in EEG Signals During Sleep”. In: 2018 IEEE 28th International Workshop on Ma- chine Learning for Signal Processing (MLSP) . Sept. 2018, pp. 1–6
work page 2018
-
[3]
DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal
Stanislas Chambon et al. “DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal”. In: arXiv preprint arXiv:1812.04079 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Maowei Cheng et al. “Recurrent neural network based classification of ecg signal features for obstruction of sleep apnea detection”. In: Computational Science and Engineering (CSE) and Embedded and Ubiquitous Computing (EUC), 2017 IEEE International Confer- ence on . V ol. 2. IEEE. 2017, pp. 199–202
work page 2017
-
[5]
Real-time apnea-hypopnea event detection during sleep by convolutional neural net- works
Sang Ho Choi et al. “Real-time apnea-hypopnea event detection during sleep by convolutional neural net- works”. In: Computers in biology and medicine 100 (2018), pp. 123–131
work page 2018
-
[6]
Large Neural Network Based Detection of Apnea, Bradycardia and Desaturation Events
Antoine Honore et al. “Large Neural Network Based Detection of Apnea, Bradycardia and Desaturation Events”. In: arXiv preprint arXiv:1711.06484 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[7]
A novel method to precisely detect apnea and hypopnea events by airflow and oximetry signals
Wu Huang et al. “A novel method to precisely detect apnea and hypopnea events by airflow and oximetry signals”. In: Computers in biology and medicine 88 (2017), pp. 32–40
work page 2017
-
[8]
Ahsan H Khandoker et al. “Support vector machines for automated recognition of obstructive sleep apnea syndrome from ECG recordings”. In: IEEE transac- tions on information technology in biomedicine 13.1 (2009), pp. 37–48
work page 2009
-
[9]
Automatic differentiation in Py- Torch
Adam Paszke et al. “Automatic differentiation in Py- Torch”. In: NIPS-W. 2017
work page 2017
-
[10]
Increased prevalence of sleep- disordered breathing in adults
Paul E Peppard et al. “Increased prevalence of sleep- disordered breathing in adults”. In: American journal of epidemiology 177.9 (2013), pp. 1006–1014
work page 2013
-
[11]
You only look once: Unified, real-time object detection
Joseph Redmon et al. “You only look once: Unified, real-time object detection”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 779–788
work page 2016
-
[12]
The American Academy of Sleep Medicine Inter-scorer Reliability program: respiratory events
Richard S. Rosenberg and Steven Van Hout. “The American Academy of Sleep Medicine Inter-scorer Reliability program: respiratory events”. In: J Clin Sleep Med 10.4 (July 2014), pp. 447–454
work page 2014
-
[13]
American Academy of Sleep Medicine (AASM). “Hidden Health Crisis Costing America Billions. Un- derdiagnosing and Undertreating Obstructive Sleep Apnea Draining Healthcare System”. In: (2016)
work page 2016
-
[14]
Changyue Song et al. “An obstructive sleep ap- nea detection approach using a discriminative hid- den Markov model from ECG signals”. In: IEEE Transactions on Biomedical Engineering 63.7 (2016), pp. 1532–1542
work page 2016
-
[15]
Real-Time Sleep Apnea Detec- tion by Classifier Combination
B. Xie and H. Minn. “Real-Time Sleep Apnea Detec- tion by Classifier Combination”. In: IEEE Transac- tions on Information Technology in Biomedicine 16.3 (May 2012), pp. 469–477
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.