pith. sign in

arxiv: 1907.05164 · v1 · pith:RZPSSZGZnew · submitted 2019-07-11 · 📡 eess.IV · cs.CV· cs.LG

Disease classification of macular Optical Coherence Tomography scans using deep learning software: validation on independent, multi-centre data

Pith reviewed 2026-05-24 23:08 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.LG
keywords optical coherence tomographydeep learningretinal diseaseAMDDMEmulti-centre validationclinical decision support
0
0 comments X

The pith

Pegasus-OCT detects macular anomalies with at least 98% AUROC across independent multi-centre OCT datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates Pegasus-OCT, a deep learning clinical decision support software, on 5,588 normal and anomalous macular OCT volumes collected from independent centres in five countries. It processes the scans and compares results against ground truth labels supplied by the dataset owners. The software achieves AUROCs of at least 98% for general macular anomalies, and at least 99% and 98% for AMD and DME on sufficient-quality scans. A sympathetic reader would care because consistent high performance across varied demographics, device manufacturers, sites and operators indicates the tool could operate reliably outside its original training environment.

Core claim

Pegasus-OCT performed with AUROCs of at least 98% for all datasets in the detection of general macular anomalies. For scans of sufficient quality, the AUROCs for general AMD and DME detection were found to be at least 99% and 98%, respectively.

What carries the argument

Pegasus-OCT deep learning software that identifies features of retinal disease from macula OCT scans and is tested for performance across heterogeneous populations.

If this is right

  • The software maintains performance when applied to data from different patient demographics and device manufacturers.
  • High detection rates hold for scans acquired at multiple independent sites by different operators.
  • The results support potential use of the software to help manage growing demand in eye care services for retinal disease.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Validation on external multi-centre data increases the chance the model will work in new clinics that use different OCT machines.
  • If performance remains stable, the software could reduce variability in initial screening for AMD and DME across regions.
  • Further tests could measure whether the high AUROCs translate into faster referral decisions in routine practice.

Load-bearing premise

Ground truth labels supplied by the dataset owners are accurate, consistent, and free of systematic bias across centers, devices, and operators.

What would settle it

Independent re-labelling of a random subset of the scans by a new panel of experts that produces labels differing on more than 10% of cases and drops the reported AUROCs below 90%.

read the original abstract

Purpose: To evaluate Pegasus-OCT, a clinical decision support software for the identification of features of retinal disease from macula OCT scans, across heterogenous populations involving varying patient demographics, device manufacturers, acquisition sites and operators. Methods: 5,588 normal and anomalous macular OCT volumes (162,721 B-scans), acquired at independent centres in five countries, were processed using the software. Results were evaluated against ground truth provided by the dataset owners. Results: Pegasus-OCT performed with AUROCs of at least 98% for all datasets in the detection of general macular anomalies. For scans of sufficient quality, the AUROCs for general AMD and DME detection were found to be at least 99% and 98%, respectively. Conclusions: The ability of a clinical decision support system to cater for different populations is key to its adoption. Pegasus-OCT was shown to be able to detect AMD, DME and general anomalies in OCT volumes acquired across multiple independent sites with high performance. Its use thus offers substantial promise, with the potential to alleviate the burden of growing demand in eye care services caused by retinal disease.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript evaluates Pegasus-OCT, a deep learning clinical decision support tool, for detecting general macular anomalies, AMD, and DME in 5,588 OCT volumes (162,721 B-scans) acquired across five independent centers in different countries using varied devices and operators. Performance is assessed via AUROC against ground-truth labels supplied by the dataset owners, yielding AUROCs ≥98% for general anomalies on all datasets and ≥99% (AMD) / ≥98% (DME) on quality-filtered scans.

Significance. A large-scale, multi-center, multi-device validation study is a strength for assessing real-world robustness of OCT classification software, which could support clinical adoption if the reported metrics are shown to reflect true generalization. The scale (five countries) addresses an important practical need in retinal disease screening.

major comments (3)
  1. [Methods] Methods: No information is supplied on the composition or provenance of the training data used to develop Pegasus-OCT, nor on any steps taken to exclude overlap with the five validation datasets. This detail is load-bearing for the central claim of robust performance on 'independent' multi-centre data.
  2. [Methods] Methods: The evaluation relies entirely on ground-truth labels supplied by the five dataset owners, yet the text provides no evidence of a unified labeling protocol, inter-rater reliability statistics, or any post-hoc audit of label consistency across centers, devices, or operators. Because all AUROCs are computed directly against these labels, systematic inter-center labeling differences could inflate or deflate the reported figures without reflecting model behavior.
  3. [Abstract] Abstract and Results: No confidence intervals, standard errors, or other measures of statistical uncertainty are reported for any AUROC value, and the quality-filtering criteria used to define the 'sufficient quality' subset are not described. Both omissions prevent assessment of the precision and scope of the headline performance claims.
minor comments (1)
  1. [Abstract] The abstract states results for 'general AMD and DME detection' but does not clarify whether these are binary detection tasks or multi-class; a brief clarification in the Methods would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We provide point-by-point responses to the major comments below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Methods] Methods: No information is supplied on the composition or provenance of the training data used to develop Pegasus-OCT, nor on any steps taken to exclude overlap with the five validation datasets. This detail is load-bearing for the central claim of robust performance on 'independent' multi-centre data.

    Authors: Pegasus-OCT is a proprietary clinical decision support tool developed using training data collected from clinical sites distinct from the five validation centers described in this study. The validation datasets were acquired independently at centers in five countries with no participation in the model's development. We will revise the Methods section to include this information on the independence of the validation data. revision: yes

  2. Referee: [Methods] Methods: The evaluation relies entirely on ground-truth labels supplied by the five dataset owners, yet the text provides no evidence of a unified labeling protocol, inter-rater reliability statistics, or any post-hoc audit of label consistency across centers, devices, or operators. Because all AUROCs are computed directly against these labels, systematic inter-center labeling differences could inflate or deflate the reported figures without reflecting model behavior.

    Authors: Each dataset owner supplied ground-truth labels according to their own clinical protocols and standards. As this study utilizes pre-existing datasets for external validation, inter-rater reliability data were not available to the authors. We will add a statement in the Methods section to clarify that labels were used as provided by the dataset owners without additional auditing. revision: partial

  3. Referee: [Abstract] Abstract and Results: No confidence intervals, standard errors, or other measures of statistical uncertainty are reported for any AUROC value, and the quality-filtering criteria used to define the 'sufficient quality' subset are not described. Both omissions prevent assessment of the precision and scope of the headline performance claims.

    Authors: We agree with the need for statistical uncertainty measures and a description of quality criteria. We will add 95% confidence intervals for all reported AUROCs, computed using bootstrap resampling. We will also describe the quality-filtering criteria applied to define the sufficient quality subset in the Methods section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation without derivation chain

full rationale

The paper is a straightforward empirical validation study that processes 5,588 OCT volumes with existing Pegasus-OCT software and reports observed AUROCs against ground-truth labels supplied by the dataset owners. No equations, parameter fitting, ansatzes, uniqueness theorems, or self-citations appear in the abstract or described methods; the reported performance figures are direct measurements on held-out data rather than quantities derived from or reduced to the paper's own inputs by construction. The central claim therefore contains independent empirical content and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the unverified accuracy of external ground-truth labels and on the assumption that the test volumes are fully independent of any data used to develop the software.

axioms (1)
  • domain assumption Ground truth labels provided by dataset owners are accurate and unbiased across all centers and devices.
    AUROC calculations are computed directly against these labels; any systematic labeling error would invalidate the reported performance figures.

pith-pipeline@v0.9.0 · 5765 in / 1250 out tokens · 21281 ms · 2026-05-24T23:08:00.158023+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    World Health Organization, Global Data on Visual Impairments 2010, 2012

  2. [2]

    Magnitude, temporal trends, and projections of the global prevalence of blindness and distance and near vision impairment: a systematic review and meta-analysis

    Bourne RRA, Flaxman SR, Braithwaite T, et al.; Vision Loss Expert Group. Magnitude, temporal trends, and projections of the global prevalence of blindness and distance and near vision impairment: a systematic review and meta-analysis. Lancet Glob Health. 2017 Sep;5(9):e888–97

  3. [3]

    The number of ophthalmologists in practice and training worldwide: a growing gap despite more than 200 000 practitioners

    Resnikoff S, Felch W, Gauthier T-M, Spivey B. The number of ophthalmologists in practice and training worldwide: a growing gap despite more than 200 000 practitioners. Br J Ophthalmol. 2012;96(6):783-787

  4. [4]

    Epidemiology of age-related macular degeneration (AMD): associations with cardiovascular disease phenotypes and lipid factors

    Pennington KL, DeAngelis MM. Epidemiology of age-related macular degeneration (AMD): associations with cardiovascular disease phenotypes and lipid factors. Eye and Vision 2016; 3:34

  5. [5]

    Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis

    Wong WL, Su X, Li X, et al. Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis. Lancet Global Health 2014; Feb 2(2):e106-16

  6. [6]

    Evaluation of optical coherence tomography retinal thickness parameters for use in clinical trials for neovascular age-related macular degeneration

    Keane PA, Liakopoulos S, Jivrajka RV, et al. Evaluation of optical coherence tomography retinal thickness parameters for use in clinical trials for neovascular age-related macular degeneration. Invest Ophthalmol Vis Sci. 2009; 50(7):3378-3385

  7. [7]

    Visual acuity and central retinal thickness: fulfilment of retreatment criteria for recurrent neovascular AMD in routine clinical care

    Reznicek L, Muhr J, Ulbig M, et al. Visual acuity and central retinal thickness: fulfilment of retreatment criteria for recurrent neovascular AMD in routine clinical care. Br J Ophthalmol. 2014; 98(10):1333-1337. 16

  8. [8]

    Optical coherence tomography monitoring strategies for A-VEGFetreated age-related macular degeneration: an evidence-based analysis

    Pron G. Optical coherence tomography monitoring strategies for A-VEGFetreated age-related macular degeneration: an evidence-based analysis. Ont Health Technol Assess Ser. 2014; 14(10):1-64. [online]. http://www.hqontario.ca/evidence/publications-and-ohtac-recommendations/ontario-health-tecno logy-assessment-series/OCT-monitoring-strategies

  9. [9]

    The Development, Commercialization, and Impact of Optical Coherence Tomography

    Fujimoto J, Swanson E. The Development, Commercialization, and Impact of Optical Coherence Tomography. Invest Ophthalmol Vis Sci. 2016; 57(9): OCT1–OCT13

  10. [10]

    Optical coherence tomography

    Huang D, Swanson EA, Lin CP, et al. Optical coherence tomography. Science. 1991; 254(5035):1178e1181

  11. [11]

    Evaluation of age-related macular degeneration with optical coherence tomography

    Keane PA, Patel PJ, Liakopoulos S, et al. Evaluation of age-related macular degeneration with optical coherence tomography. Surv Ophthalmol. 2012; 57(5):389-414

  12. [12]

    Ophthalmic imaging

    Ilginis T, Clarke J, Patel PJ. Ophthalmic imaging. Br Med Bull. 2014; 111(1):77-88

  13. [13]

    Computer-aided diagnosis: how to move from the laboratory to the clinic

    van Ginneken B, Schaefer-Prokop CM, Prokop M. Computer-aided diagnosis: how to move from the laboratory to the clinic. Radiology. 2011; 261(3):719-732

  14. [14]

    Validation of automated screening for referable diabetic retinopathy with the IDx-DR device in the Hoorn Diabetes Care System

    van der Heijden AA, Abramoff MD, Verbraak F, et al. Validation of automated screening for referable diabetic retinopathy with the IDx-DR device in the Hoorn Diabetes Care System. Acta Ophthalmol. 2018; 96(1):63-68

  15. [15]

    ImageNet classification with deep convolutional networks

    Krizhevsky A, Sutskever I, Hinton G. ImageNet classification with deep convolutional networks. NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, 2012; 1097-1105

  16. [16]

    Learning convolutional feature hierarchies for visual recognition

    Kavukcuoglu K, Sermanet P, Boureau Y-L, et al. Learning convolutional feature hierarchies for visual recognition. NIPS'10 Proceedings of the 25th International Conference on Neural Information Processing Systems 2010; 1090-1098

  17. [17]

    A survey on deep learning in medical image analysis

    Litjens G, Kooi T, Ehteshami Bejnordi B, et al. A survey on deep learning in medical image analysis. Medical Image Analysis, 2017; 42:60-88

  18. [18]

    Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study

    Zech JR, Badgeley MA, Liu M, et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med 2018; 15(11): e1002683

  19. [19]

    Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping

    Caruana R, Lawrence S, Giles L. Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. NIPS'00 Proceedings of the 13th International Conference on Neural Information Processing Systems, 2000;381-387. 17

  20. [20]

    Retina Improved Automated Detection of Diabetic Retinopathy on a Publicly Available Dataset Through Integration of Deep Learning

    Abràmoff M, Lou Y, Erginay A, et al. Retina Improved Automated Detection of Diabetic Retinopathy on a Publicly Available Dataset Through Integration of Deep Learning. Invest Ophthalmol Vis Sci. 2016; 57(13):5200-5206

  21. [21]

    Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes

    Ting DSW, Cheung CYL, Lim G et al. Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes. JAMA. 2017;318(22):2211-2223

  22. [22]

    Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs

    Gulshan V, Peng L, Coram M, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA. 2016; 316(22):2402-2410

  23. [23]

    Automated Identification of Diabetic Retinopathy Using Deep Learning

    Gargeya R, Leng T. Automated Identification of Diabetic Retinopathy Using Deep Learning. Ophthalmology. 2017; 124(7):962-969

  24. [24]

    Screening for Diabetic Retinopathy in the Central Region of Portugal

    Ribeiro L, Oliveira CM, Neves C, et al. Screening for Diabetic Retinopathy in the Central Region of Portugal. Added Value of Automated 'Disease/No Disease' Grading. Ophthalmologica 2015; 233:96-103

  25. [25]

    Automated Diabetic Retinopathy Image Assessment Software: Diagnostic Accuracy and Cost-Effectiveness Compared with Human Graders

    Tufail A, Rudisill C, Egan C, et al. Automated Diabetic Retinopathy Image Assessment Software: Diagnostic Accuracy and Cost-Effectiveness Compared with Human Graders. Ophthalmology 2017; 124(3):343-351

  26. [26]

    Graefes Arch Clin Exp Ophthalmol

    Treder M, Lauermann JL, Eter N, Automated detection of exudative age-related macular degeneration in spectral domain optical coherence tomography using deep learning. Graefes Arch Clin Exp Ophthalmol. 2018; 256, 259–265

  27. [27]

    Fully automated detection and quantification of macular fluid in OCT using deep learning

    Schlegl T, Waldstein SM, Bogunovic H, et al. Fully automated detection and quantification of macular fluid in OCT using deep learning. Ophthalmology 2018; 125, 549–558

  28. [28]

    Deep Learning Is Effective for Classifying Normal versus Age-Related Macular Degeneration Optical Coherence Tomography Images

    Lee CS, Baughman DM, Lee AY. Deep Learning Is Effective for Classifying Normal versus Age-Related Macular Degeneration Optical Coherence Tomography Images. Ophthalmology Retina 2017; 1(4):322-327

  29. [29]

    Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning

    Kermany D, Goldbaum M, Cai W, et al. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell, 2018; 172:122-1131

  30. [30]

    Clinically applicable deep learning for diagnosis and referral in retinal disease

    De Fauw J, Ledsam J, Romera-Paredes B, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine 2018; 24:1342–1350

  31. [31]

    Retinal thickness analysis by race, gender, and age using Stratus OCT

    Kashani AH, Zimmer-Galler IE, Shah SM. Retinal thickness analysis by race, gender, and age using Stratus OCT. Am J Ophthalmol. 2010; 149(3):496-502. 18

  32. [32]

    Effects of sex and age on the normal retinal and choroidal structures on optical coherence tomography

    Ooto S, Hangai M, Yoshimura N. Effects of sex and age on the normal retinal and choroidal structures on optical coherence tomography. Curr Eye Res. 2015; 40(2):213-25

  33. [33]

    Available at: ​ https://www.nice.org.uk/guidance/ng82/resources ​ [Accessed October 2018]

    National Institute for Health and Care Excellence (2018) Tuberculosis (NICE Guideline 82). Available at: ​ https://www.nice.org.uk/guidance/ng82/resources ​ [Accessed October 2018]

  34. [34]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations, 2015

  35. [35]

    Quantitative Classification of Eyes with and without Intermediate Age-related Macular Degeneration Using Optical Coherence Tomography

    Farsiu S, Chiu SJ, O’Connell RV, et al. Quantitative Classification of Eyes with and without Intermediate Age-related Macular Degeneration Using Optical Coherence Tomography. Ophthalmology 2014; 121(1):162-172

  36. [36]

    Macular OCT Classification using a Multi-Scale Convolutional Neural Network Ensemble

    Rasti R, Rabbani H, Mehridehnavi A, Hajizadeh F. Macular OCT Classification using a Multi-Scale Convolutional Neural Network Ensemble. IEEE Trans. Med. Im. 2018; 37(4):1024-1034

  37. [37]

    Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

    Buolamwini J, Gebru T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Conference on Fairness, Accountability, and Transparency. Proceedings of Machine Learning Research 2018; 81:1–15. 19 Table 1: ​ Independent evaluation datasets used in this paper Name Acquisition device manufacturer(s) Countries Number of acquisi...