Recognition: unknown
Validating a Deep Learning Algorithm to Identify Patients with Glaucoma using Systemic Electronic Health Records
Pith reviewed 2026-05-10 01:36 UTC · model grok-4.3
The pith
A deep learning model trained on national data and fine-tuned locally identifies glaucoma patients from systemic electronic health records with AUROC 0.883.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A pretrained glaucoma risk assessment model, when fine-tuned on institutional electronic health record data, achieves an AUROC of 0.883 and positive predictive value of 0.657 for identifying patients with glaucoma using only systemic records, with predictions that align well with observed diagnosis rates of 65.7 percent and treatment rates of 57.0 percent in the highest prediction decile.
What carries the argument
The glaucoma risk assessment (GRA) deep learning model that ingests systemic EHR features including diagnoses, medications, labs, and exams to output a glaucoma probability score.
If this is right
- Performance improves when more layers are made trainable, up to 15 layers, and with larger amounts of local training data.
- The highest-risk patients identified by the model have substantially elevated rates of glaucoma diagnosis and treatment in real clinical records.
- An EHR-only approach removes the need for specialized ocular imaging or eye-specific exams during initial risk assessment.
- The model supports scalable pre-screening that could be applied to large patient populations during routine care.
Where Pith is reading between the lines
- Integration into general electronic health record systems could automatically flag at-risk patients during non-eye medical visits.
- Wider use might improve early detection rates for glaucoma, which is often asymptomatic until late stages.
- Further tests at additional sites with minimal or no fine-tuning would reveal how much site-specific adaptation is truly required for consistent results.
Load-bearing premise
Recorded electronic health record diagnoses of glaucoma accurately reflect true disease presence, and the model will perform similarly at other institutions without site-specific fine-tuning.
What would settle it
Deploying the model at another independent health system and finding an AUROC substantially below 0.8 or poor calibration where the top risk decile does not show elevated glaucoma diagnosis rates would falsify reliable transfer.
Figures
read the original abstract
We evaluated whether a glaucoma risk assessment (GRA) model trained on All of Us national data can identify patients at high probability of glaucoma using only systemic electronic health records (EHR) at an independent institution. In this cross-sectional study, 20,636 Stanford patients seen from November 2013 to January 2024 were included (15% with glaucoma). A pretrained GRA model was fine-tuned on the Stanford cohort and tested on a held-out set using demographics, systemic diagnoses, medications, laboratory results, and physical examination measurements as inputs. The best model achieved AUROC 0.883 and PPV 0.657. Calibration was consistent with clinical risk: the highest prediction decile showed the greatest glaucoma diagnosis rate (65.7%) and treatment rate (57.0%). Performance improved with more trainable layers up to 15 and with additional data. An EHR-only GRA model may enable scalable and accessible pre-screening without specialized imaging.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates a deep learning glaucoma risk assessment model pretrained on the All of Us national dataset and fine-tuned on systemic EHR data (demographics, diagnoses, medications, labs, physical exams) from Stanford. On a held-out Stanford cohort of 20,636 patients (15% glaucoma), the best model achieves AUROC 0.883 and PPV 0.657. Calibration is reported as clinically consistent, with the top prediction decile showing 65.7% glaucoma diagnosis rate and 57.0% treatment rate. Performance improves with more trainable layers (up to 15) and additional data. The authors conclude that an EHR-only model could support scalable pre-screening without imaging.
Significance. If the central claims hold after addressing label issues, the work provides evidence for transfer learning from large national cohorts to institutional EHR for systemic-disease prediction, with strengths in external pretraining, progressive fine-tuning improvements, and decile-based calibration analysis. This could support accessible glaucoma pre-screening in non-specialist settings. The external grounding relative to All of Us pretraining is a positive feature.
major comments (3)
- Abstract and Results: The headline metrics (AUROC 0.883, PPV 0.657, 65.7% diagnosis rate in top decile) are computed against Stanford EHR-recorded glaucoma diagnoses as ground truth. Because glaucoma is known to be underdiagnosed in non-ophthalmic EHR and ICD-based labels have documented sensitivity/specificity limitations, these metrics demonstrate correlation with existing records rather than validated identification of true (including undiagnosed) glaucoma cases. This directly affects the pre-screening interpretation and requires explicit discussion or sensitivity analysis on label noise.
- Methods: Exclusion criteria for the 20,636-patient Stanford cohort, the precise definition of glaucoma labels (e.g., specific ICD codes or encounter requirements), and the strategy for handling missing values in systemic EHR inputs are not described. These omissions are load-bearing for assessing selection bias and reproducibility of the reported performance.
- Results: The held-out test set is drawn from the same Stanford institution used for fine-tuning (after All of Us pretraining). While this provides some external grounding, the manuscript should clarify the train/fine-tune/test split proportions and discuss whether performance would hold at other sites without site-specific fine-tuning.
minor comments (2)
- The abstract states performance 'improved with more trainable layers up to 15 and with additional data' but does not report the specific layer counts, data volumes, or ablation tables supporting this claim.
- Clarify the size of the held-out test set and the fraction of the Stanford cohort used for fine-tuning versus testing.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which highlight important considerations for interpreting our results and ensuring methodological transparency. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract and Results: The headline metrics (AUROC 0.883, PPV 0.657, 65.7% diagnosis rate in top decile) are computed against Stanford EHR-recorded glaucoma diagnoses as ground truth. Because glaucoma is known to be underdiagnosed in non-ophthalmic EHR and ICD-based labels have documented sensitivity/specificity limitations, these metrics demonstrate correlation with existing records rather than validated identification of true (including undiagnosed) glaucoma cases. This directly affects the pre-screening interpretation and requires explicit discussion or sensitivity analysis on label noise.
Authors: We agree that the ground truth consists of EHR-recorded glaucoma diagnoses, which are imperfect due to underdiagnosis and the known limitations of ICD codes. Our model is designed to predict recorded glaucoma status from systemic features, providing a practical tool for pre-screening patients who may warrant ophthalmic evaluation. In the revised manuscript, we will add a dedicated paragraph in the Discussion section explicitly addressing label noise, citing literature on glaucoma underdiagnosis rates and ICD sensitivity/specificity. While a quantitative sensitivity analysis on label noise cannot be performed without additional ground-truth data (such as ophthalmic exams), we will discuss its implications for the pre-screening use case and emphasize that the observed calibration in the top decile supports clinical utility even with noisy labels. revision: partial
-
Referee: Methods: Exclusion criteria for the 20,636-patient Stanford cohort, the precise definition of glaucoma labels (e.g., specific ICD codes or encounter requirements), and the strategy for handling missing values in systemic EHR inputs are not described. These omissions are load-bearing for assessing selection bias and reproducibility of the reported performance.
Authors: We acknowledge these omissions and will correct them in the revised manuscript. The Methods section will be expanded to detail: the exclusion criteria applied to arrive at the final 20,636-patient cohort; the exact ICD-9 and ICD-10 codes (with any minimum encounter or diagnosis requirements) used to define glaucoma labels; and the handling of missing values, including any imputation techniques, variable exclusion thresholds, or encoding strategies for the systemic EHR inputs. These additions will enable full assessment of selection bias and reproducibility. revision: yes
-
Referee: Results: The held-out test set is drawn from the same Stanford institution used for fine-tuning (after All of Us pretraining). While this provides some external grounding, the manuscript should clarify the train/fine-tune/test split proportions and discuss whether performance would hold at other sites without site-specific fine-tuning.
Authors: We will clarify the data partitioning in the revised Methods and Results sections, specifying that the Stanford cohort was split into 70% for fine-tuning, 10% for internal validation during fine-tuning, and 20% for the held-out test set. On generalizability, we agree that site-specific fine-tuning may be required for optimal performance elsewhere. The revised Discussion will explicitly address this limitation, noting that the All of Us pretraining provides a transferable foundation but that multi-institutional validation without additional fine-tuning remains an important direction for future work. This does not alter the core demonstration of effective transfer from national to institutional data. revision: yes
Circularity Check
No circularity: performance metrics derived from held-out evaluation on independent data
full rationale
The paper trains a deep learning model on All of Us data, fine-tunes it on a Stanford cohort, and reports AUROC/PPV/calibration on a held-out Stanford test set using systemic EHR features as inputs and recorded glaucoma diagnoses as labels. This is standard supervised learning with external validation; the reported metrics are not equivalent to any fitted parameter or input by construction. No equations, self-definitional steps, or load-bearing self-citations that reduce the central claim to prior inputs appear in the provided text. The derivation chain consists of empirical training and testing rather than mathematical or definitional reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption EHR-recorded glaucoma diagnoses are accurate and complete proxies for true clinical status
Reference graph
Works this paper leans on
-
[1]
Validating a Deep Learning Algorithm to Identify Patients with Glaucoma using Systemic Electronic Health Records John Xiang, BA1; Rohith Ravindranath, MS1; Sophia Y. Wang, MD, MS1 1Department of Ophthalmology, Byers Eye Institute, Stanford University, Stanford, California, USA Abstract We evaluated whether a glaucoma risk assessment (GRA) model trained on...
2013
-
[2]
Characteristics of the study cohort. Overall Mean Overall SD Glaucoma Mean Glaucoma SD Non-Glaucoma Mean Non-Glaucoma SD Age 64.64 18.42 75.34 15.30 62.70 18.28 Total Population Total Population % Glaucoma Patients Glaucoma Patients % Non-Glaucoma Patients Non-Glaucoma Patients % N 20636 100.00% 3165 100.00% 17471 100.00% Male 8637 41.85% 1377 43.51% 7260...
2070
-
[3]
US Preventive Services Task Force. Screening for Primary Open-Angle Glaucoma: US Preventive Services Task Force Recommendation Statement. JAMA. 2022 May 24;327(20):1992–7. doi:10.1001/jama.2022.7013
-
[4]
The Lancet Commission on diagnostics: transforming access to diagnostics
Fleming KA, Horton S, Wilson ML, Atun R, DeStigter K, Flanigan J, et al. The Lancet Commission on diagnostics: transforming access to diagnostics. The Lancet. 2021 Nov 27;398(10315):1997–2050. doi:10.1016/S0140-6736(21)00673-5 PubMed PMID: 34626542
-
[5]
Optimizing integrated imaging service delivery by tier in low-resource health systems
DeStigter K, Pool KL, Leslie A, Hussain S, Tan BS, Donoso-Bach L, et al. Optimizing integrated imaging service delivery by tier in low-resource health systems. Insights Imaging. 2021 Sep 16;12(1):129. doi:10.1186/s13244-021-01073-8
-
[6]
Riley RD, Archer L, Snell KIE, Ensor J, Dhiman P, Martin GP, et al. Evaluation of clinical prediction models (part 2): how to undertake an external validation study. BMJ. 2024 Jan 15;384:e074820. doi:10.1136/bmj-2023-074820 PubMed PMID: 38224968
-
[7]
From development to deployment: dataset shift, causality, and shift-stable models in health AI
Subbaswamy A, Saria S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics. 2020 Apr 1;21(2):345–52. doi:10.1093/biostatistics/kxz041
-
[8]
Burden of Undetected and Untreated Glaucoma in the United States
Shaikh Y, Yu F, Coleman AL. Burden of Undetected and Untreated Glaucoma in the United States. Am J Ophthalmol. 2014 Dec;158(6):1121-1129.e1. doi:10.1016/j.ajo.2014.08.023
-
[9]
Causes of blindness and visual impairment in a population-based sample of U.S
Rodriguez J, Sanchez R, Munoz B, West SK, Broman A, Snyder RW, et al. Causes of blindness and visual impairment in a population-based sample of U.S. Hispanics. Ophthalmology. 2002 Apr;109(4):737–43. doi:10.1016/S0161-6420(01)01008-9
-
[10]
Giangiacomo A, Coleman AL. The Epidemiology of Glaucoma. In: Grehn F, Stamper R, editors. Glaucoma [Internet]. Berlin, Heidelberg: Springer; 2009 [cited 2025 May 7]. p. 13–21. Available from: https://doi.org/10.1007/978-3-540-69475-5_2 doi:10.1007/978-3-540-69475-5_2
-
[11]
Kang JH, Wang M, Frueh L, Rosner B, Wiggs JL, Elze T, et al. Cohort Study of Race/Ethnicity and Incident Primary Open-Angle Glaucoma Characterized by Autonomously Determined Visual Field Loss Patterns. Transl Vis Sci Technol. 2022 Jul 25;11(7):21. doi:10.1167/tvst.11.7.21
-
[12]
A Case for The Use of Artificial Intelligence in Glaucoma Assessment
Schuman JS, De Los Angeles Ramos Cadena M, McGee R, Al-Aswad LA, Medeiros FA. A Case for The Use of Artificial Intelligence in Glaucoma Assessment. Ophthalmol Glaucoma. 2022;5(3):e3–13. doi:10.1016/j.ogla.2021.12.003 PubMed PMID: 34954220; PubMed Central PMCID: PMC9133028
-
[13]
Ravindranath R, Wang SY. Artificial Intelligence Models to Identify Patients with High Probability of Glaucoma Using Electronic Health Records. Ophthalmol Sci. 2025 May;5(3):100671. doi:10.1016/j.xops.2024.100671
- [14]
-
[15]
Gupta P, Zhao D, Guallar E, Ko F, Boland MV, Friedman DS. Prevalence of Glaucoma in the United States: The 2005-2008 National Health and Nutrition Examination Survey. Invest Ophthalmol Vis Sci. 2016 May 1;57(6):2905–13. doi:10.1167/iovs.15-18469 PubMed PMID: 27168366; PubMed Central PMCID: PMC4868098
-
[16]
[cited 2025 May 5]
Data Model Conventions [Internet]. [cited 2025 May 5]. Available from: https://ohdsi.github.io/CommonDataModel/dataModelConventions.html
2025
-
[17]
Buuren S van, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011 Dec 12;45:1–67. doi:10.18637/jss.v045.i03
-
[18]
Scikit-learn: Machine Learning in Python
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12(85):2825–30
2011
-
[19]
Ren W, Liu Z, Wu Y, Zhang Z, Hong S, Liu H, et al. Moving Beyond Medical Statistics: A Systematic Review on Missing Data Handling in Electronic Health Records. Health Data Sci. 2024 Jan;4:0176. doi:10.34133/hds.0176
-
[20]
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining [Internet]. New York, NY, USA: Association for Computing Machinery; 2016 [cited 2026 Feb 17]. p. 785–94. (KDD ’16). Available from: https://dl.acm.org/doi/10.1145/2939672.2939785 doi:10.1145/2...
-
[21]
URL https://bmcmedicine.biomedcentral
Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019 Dec 16;17(1):230. doi:10.1186/s12916-019-1466-7
-
[22]
The Relationship Between Cup-to-Disc Ratio and Estimated Number of Retinal Ganglion Cells
Tatham AJ, Weinreb RN, Zangwill LM, Liebmann JM, Girkin CA, Medeiros FA. The Relationship Between Cup-to-Disc Ratio and Estimated Number of Retinal Ganglion Cells. Invest Ophthalmol Vis Sci. 2013 May;54(5):3205–14. doi:10.1167/iovs.12-11467 PubMed PMID: 23557744; PubMed Central PMCID: PMC3648225
-
[23]
Carpel EF, Engstrom PF. The Normal Cup-Disk Ratio. Am J Ophthalmol. 1981 May;91(5):588–97. doi:10.1016/0002-9394(81)90056-8
-
[24]
2025 [cited 2026 Feb 23]
American Academy of Ophthalmology [Internet]. 2025 [cited 2026 Feb 23]. Eye Pressure Testing. Available from: https://www.aao.org/eye-health/anatomy/eye-pressure-testing
2025
-
[25]
Glaucoma is second leading cause of blindness globally
Kingman S. Glaucoma is second leading cause of blindness globally. Bull World Health Organ. 2004 Nov;82(11):887–8. PubMed PMID: 15640929; PubMed Central PMCID: PMC2623060
2004
-
[26]
Singh H, Mhasawade V, Chunara R. Generalizability challenges of mortality risk prediction models: A retrospective analysis on a multi-center database. PLOS Digit Health. 2022 Apr 5;1(4):e0000023. doi:10.1371/journal.pdig.0000023 PubMed PMID: 36812510; PubMed Central PMCID: PMC9931319
-
[27]
Yang J, Soltan AAS, Clifton DA. Machine learning generalizability across healthcare settings: insights from multi-site COVID-19 screening. Npj Digit Med. 2022 Jun 7;5(1):1–8. doi:10.1038/s41746-022-00614-9
-
[28]
Voss EA, Makadia R, Matcho A, Ma Q, Knoll C, Schuemie M, et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J Am Med Inform Assoc JAMIA. 2015 May;22(3):553–64. doi:10.1093/jamia/ocu023 PubMed PMID: 25670757; PubMed Central PMCID: PMC4457111
-
[29]
Sosna J, Joskowicz L, Saban M. Navigating the AI Landscape in Medical Imaging: A Critical Analysis of Technologies, Implementation, and Implications. Radiology. 2025 Jun;315(3):e240982. doi:10.1148/radiol.240982 PubMed PMID: 40552997
-
[30]
Predicting glaucoma progression using deep learning framework guided by generative algorithm
Hussain S, Chua J, Wong D, Lo J, Kadziauskiene A, Asoklis R, et al. Predicting glaucoma progression using deep learning framework guided by generative algorithm. Sci Rep. 2023 Nov 15;13(1):19960. doi:10.1038/s41598-023-46253-2
-
[31]
Wu Y, Iorga M, Badhe S, Zhang J, Cantrell DR, Tanhehco EJ, et al. Precise Image-level Localization of Intracranial Hemorrhage on Head CT Scans with Deep Learning Models Trained on Study-level Labels. Radiol Artif Intell. 2024 Nov;6(6):e230296. doi:10.1148/ryai.230296
-
[32]
Farič N, Hinder S, Williams R, Ramaesh R, Bernabeu MO, van Beek E, et al. Early Experiences of Integrating an Artificial Intelligence-Based Diagnostic Decision Support System into Radiology Settings: A Qualitative Study. In: Telehealth Ecosystems in Practice [Internet]. IOS Press; 2023 [cited 2026 Feb 10]. p. 240–1. Available from: https://ebooks.iospress...
-
[33]
Craig JE, Han X, Qassim A, Hassall M, Cooke Bailey JN, Kinzy TG, et al. Multitrait analysis of glaucoma identifies new risk loci and enables polygenic prediction of disease susceptibility and progression. Nat Genet. 2020 Feb;52(2):160–6. doi:10.1038/s41588-019-0556-y
-
[34]
Polygenic Risk Scores for Glaucoma Onset in the Ocular Hypertension Treatment Study
Singh RK, Zhao Y, Elze T, Fingert J, Gordon M, Kass MA, et al. Polygenic Risk Scores for Glaucoma Onset in the Ocular Hypertension Treatment Study. JAMA Ophthalmol. 2024 Apr 1;142(4):356–63. doi:10.1001/jamaophthalmol.2024.0151 PubMed PMID: 38483402; PubMed Central PMCID: PMC10941023
-
[35]
Polygenic Risk Scores Driving Clinical Change in Glaucoma
Kolovos A, Hassall MM, Siggs OM, Souzeau E, Craig JE. Polygenic Risk Scores Driving Clinical Change in Glaucoma. Annu Rev Genomics Hum Genet. 2024 Aug;25(1):287–308. doi:10.1146/annurev-genom-121222-105817 PubMed PMID: 38599222
-
[36]
de Vries VA, Hanyuda A, Vergroesen JE, Do R, Friedman DS, Kraft P, et al. The Clinical Usefulness of a Glaucoma Polygenic Risk Score in 4 Population-Based European Ancestry Cohorts. Ophthalmology. 2025 Feb;132(2):228–37. doi:10.1016/j.ophtha.2024.08.005 PubMed PMID: 39128550
-
[37]
A survey on deep learning for polygenic risk scores
Schuran M, Goudey B, Dite GS, Makalic E. A survey on deep learning for polygenic risk scores. Brief Bioinform. 2025 Jul 1;26(4):bbaf373. doi:10.1093/bib/bbaf373
-
[38]
Predictive Utility of a Coronary Artery Disease Polygenic Risk Score in Primary Prevention
Marston NA, Pirruccello JP, Melloni GEM, Koyama S, Kamanu FK, Weng LC, et al. Predictive Utility of a Coronary Artery Disease Polygenic Risk Score in Primary Prevention. JAMA Cardiol. 2023 Feb 1;8(2):130–7. doi:10.1001/jamacardio.2022.4466
-
[39]
A multi-ancestry polygenic risk score improves risk prediction for coronary artery disease
Patel AP, Wang M, Ruan Y, Koyama S, Clarke SL, Yang X, et al. A multi-ancestry polygenic risk score improves risk prediction for coronary artery disease. Nat Med. 2023 Jul;29(7):1793–803. doi:10.1038/s41591-023-02429-x
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.