Performance of weakly-supervised electronic health record-based phenotyping methods in rare-outcome settings
Pith reviewed 2026-05-10 16:29 UTC · model grok-4.3
The pith
Weakly-supervised methods for rare medical conditions in electronic health records perform unevenly and depend on tuning and silver label strength.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through simulations that range from simple to complex data-generating processes and include different outcome rates and silver label qualities, the study finds that PheNorm, MAP, and SureLDA do not consistently outperform one another on all accuracy measures. SureLDA often ranks high when silver labels are informative, yet all three methods are sensitive to chosen tuning parameters. The authors conclude that these approaches can be useful in rare-outcome settings when the proxies are strong predictors, but caution is warranted if the resulting probabilities feed into further analyses.
What carries the argument
An extensive simulation study that generates synthetic electronic health record data with varying outcome rarity and silver-label noise, then applies and evaluates three weakly-supervised phenotyping algorithms (PheNorm, MAP, SureLDA) that combine structured features with natural-language-processing outputs.
Load-bearing premise
The simulated data patterns and label noise levels match the statistical behavior of real electronic health record data for rare medical events.
What would settle it
A real-world chart review in a rare-outcome electronic health record cohort where the predicted probabilities from the three methods fail to enrich for true cases or show poor calibration when silver labels are only moderately predictive would falsify the claim that the methods work well under those conditions.
Figures
read the original abstract
Accurately identifying patients with specific medical conditions is a key challenge when using clinical data from electronic health records. Our objective was to comprehensively assess when weakly-supervised prediction methods, which use silver-standard labels (proxy measures of the true outcome) rather than gold-standard true labels, perform well in rare-outcome settings like vaccine safety studies. We compared three methods (PheNorm, MAP, and sureLDA) that combine structured features and features derived from clinical text using natural language processing, through an extensive simulation study with data-generating mechanisms ranging from simple to complex, varying outcome rates, and varying degrees of informative silver labels. We also considered using predicted probabilities to design a chart review validation study. No single method dominated the other across all prediction performance metrics. Probability-guided sampling selected a cohort enriched for patients with more mentions of important concepts in chart notes. SureLDA, the most complex of the three algorithms we considered, often performed well in simulations. Performance depended greatly on selected tuning parameters. Care should be taken when using weakly-supervised prediction methods in rare-outcome settings, particularly if the probabilities will be used in downstream analysis, but these methods can work well when silver labels are strong predictors of true outcomes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an extensive simulation study comparing three weakly-supervised EHR phenotyping methods (PheNorm, MAP, and SureLDA) for rare outcomes. Simulations vary outcome prevalence, silver-label informativeness, and data-generating mechanism complexity (simple to complex). Performance is assessed across multiple metrics; the authors also examine using predicted probabilities for probability-guided chart-review sampling. Main conclusions: no method dominates all metrics; SureLDA often performs well; results are highly sensitive to tuning parameters; the methods can succeed when silver labels are strong predictors, but caution is warranted for rare-outcome settings and downstream analyses.
Significance. If the simulation results are representative, the work supplies practical comparative guidance for weakly-supervised phenotyping in low-prevalence settings such as vaccine safety studies. The systematic variation of prevalence, label strength, and complexity is a clear strength that supports the claim of no universal winner and the conditional recommendation. The additional exploration of probability-guided sampling for validation is a useful extension. The study is reproducible in principle via its simulation framework, though the absence of real-data validation limits direct translation to practice.
major comments (1)
- Simulation design (throughout §3 and §4): The data-generating mechanisms vary prevalence and silver-label strength but do not incorporate key real-EHR features such as differential missingness by outcome rarity or correlated noise between structured codes and NLP mentions. Because the central practical claim ('these methods can work well when silver labels are strong predictors of true outcomes') rests on the simulations reflecting actual label-noise structures, this omission is load-bearing for the conditional recommendation in the Abstract and Discussion.
minor comments (3)
- Abstract: inconsistent capitalization of the algorithm name ('sureLDA' vs. 'SureLDA').
- Methods section: the precise definitions of the performance metrics (e.g., how AUC, F1, and calibration are computed under rare-event imbalance) should be stated explicitly rather than referenced only to prior work.
- Results: tables reporting performance across tuning-parameter grids would benefit from clearer indication of which parameter combinations were selected as 'default' versus 'optimized'.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review of our simulation study. We address the major comment point by point below, with revisions to the manuscript where appropriate.
read point-by-point responses
-
Referee: Simulation design (throughout §3 and §4): The data-generating mechanisms vary prevalence and silver-label strength but do not incorporate key real-EHR features such as differential missingness by outcome rarity or correlated noise between structured codes and NLP mentions. Because the central practical claim ('these methods can work well when silver labels are strong predictors of true outcomes') rests on the simulations reflecting actual label-noise structures, this omission is load-bearing for the conditional recommendation in the Abstract and Discussion.
Authors: We agree that the simulations do not explicitly model differential missingness by outcome rarity or correlated noise between structured codes and NLP-derived features, both of which are plausible in real EHR data. Our data-generating processes were constructed to span simple to complex mechanisms while systematically varying prevalence and silver-label informativeness, but they remain abstractions and do not capture every possible dependence structure. We will revise the Discussion to add an explicit limitations paragraph acknowledging these omissions and noting that the reported performance advantages (particularly for sureLDA) and the sensitivity to tuning and label strength should be interpreted most confidently when silver labels are strong predictors, as directly varied in the simulations. This addition will also qualify the Abstract and Discussion recommendations accordingly. We do not plan to expand the simulation design itself at this stage, as the existing framework already isolates the effects of prevalence and label strength across multiple metrics. revision: partial
Circularity Check
No circularity in simulation-based comparative evaluation
full rationale
The paper conducts an empirical simulation study comparing PheNorm, MAP, and SureLDA under varied data-generating mechanisms, outcome prevalences, and silver-label strengths. No derivation chain, first-principles result, or prediction is claimed; performance metrics are computed directly from simulated ground truth. Recommendations follow observed empirical rankings rather than any reduction to fitted inputs or self-citations. External validity of the DGMs is a separate concern, not circularity.
Axiom & Free-Parameter Ledger
free parameters (3)
- outcome prevalence
- silver-label informativeness
- tuning parameters
axioms (1)
- domain assumption Simulated data-generating processes capture essential statistical features of real EHR phenotyping tasks
Reference graph
Works this paper leans on
-
[1]
Ahuja, Y., D. Zhou, Z. He, J. Sun, V. M. Castro, V. Gainer, S. N. Murphy, C. Hong, and T. Cai (2020, August). sureLDA : A multidisease automated phenotyping method for the electronic health record. Journal of the American Medical Informatics Association: JAMIA\/ 27\/ (8), 1235--1243
work page 2020
-
[2]
Babazadeh, A., Z. Mohseni Afshar, M. Javanian, M. Mohammadnia-Afrouzi , A. Karkhah, J. Masrour-Roudsari , P. Sabbagh, V. Koppolu, V. K. Vasigala, and S. Ebrahimpour (2019, December). Influenza vaccination and Guillain -- Barr\'e syndrome: Reality or fear. Journal of Translational Internal Medicine\/ 7\/ (4), 137--142
work page 2019
-
[3]
Bach, S. H., B. He, A. Ratner, and C. R \'e (2017, August). Learning the Structure of Generative Models without Labeled Data . Proceedings of Machine Learning Research\/ 70 , 273--282
work page 2017
-
[4]
Becker, F., A. K. Smilde, and E. Acar (2023, July). Unsupervised EHR -based phenotyping via matrix and tensor decompositions. WIREs Data Mining and Knowledge Discovery\/ 13\/ (4), e1494
work page 2023
-
[5]
Bhatt, A. S., E. E. McElrath, B. L. Claggett, D. L. Bhatt, D. S. Adler, S. D. Solomon, and M. Vaduganathan (2021, August). Accuracy of ICD-10 Diagnostic Codes to Identify COVID-19 Among Hospitalized Patients . Journal of General Internal Medicine\/ 36\/ (8), 2532--2535
work page 2021
-
[6]
Campbell, R. L., M. L. Alpern, J. T. Li, J. B. Hagan, M. Motosue, A. F. Mullan, L. S. Harper, C. M. Lohse, and M. M. Jeffery (2023, February). Development of a machine learning algorithm based on administrative claims data for identification of ED anaphylaxis patient visits. Journal of Allergy and Clinical Immunology: Global\/ 2\/ (1), 61--68
work page 2023
-
[7]
Carrell, D. S., J. S. Floyd, S. Gruber, B. L. Hazlehurst, P. J. Heagerty, J. C. Nelson, B. D. Williamson, and R. Ball (2024, August). A general framework for developing computable clinical phenotype algorithms. Journal of the American Medical Informatics Association\/ 31\/ (8), 1785--1796
work page 2024
-
[8]
Carrell, D. S., S. Gruber, J. S. Floyd, M. A. Bann, K. L. Cushing-Haugen , R. L. Johnson, V. Graham, D. J. Cronkite, B. L. Hazlehurst, A. H. Felcher, C. A. Bejan, A. Kennedy, M. Shinde, S. Karami, Y. Ma, D. Stojanovic, Y. Zhao, R. Ball, and J. Nelson (2023, February). Improving Methods of Identifying Anaphylaxis for Medical Product Safety Surveillance Usi...
work page 2023
-
[9]
Davis, R. L., M. Kolczak, E. Lewis, J. Nordin, M. Goodman, D. K. Shay, R. Platt, S. Black, H. Shinefield, and R. T. Chen (2005, May). Active Surveillance of Vaccine Safety : A System to Detect Early Signs of Adverse Events . Epidemiology\/ 16\/ (3), 336--341
work page 2005
-
[10]
De Freitas, J. K., K. W. Johnson, E. Golden, G. N. Nadkarni, J. T. Dudley, E. P. Bottinger, B. S. Glicksberg, and R. Miotto (2021, September). Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records. Patterns\/ 2\/ (9), 100337
work page 2021
-
[11]
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977, September). Maximum Likelihood from Incomplete Data Via the EM Algorithm . Journal of the Royal Statistical Society Series B: Statistical Methodology\/ 39\/ (1), 1--22
work page 1977
-
[12]
Denny, J. C., L. Bastarache, M. D. Ritchie, R. J. Carroll, R. Zink, J. D. Mosley, J. R. Field, J. M. Pulley, A. H. Ramirez, E. Bowton, M. A. Basford, D. S. Carrell, P. L. Peissig, A. N. Kho, J. A. Pacheco, L. V. Rasmussen, D. R. Crosslin, P. K. Crane, J. Pathak, S. J. Bielinski, S. A. Pendergrass, H. Xu, L. A. Hindorff, R. Li, T. A. Manolio, C. G. Chute, ...
work page 2013
-
[13]
Denny, J. C., M. D. Ritchie, M. A. Basford, J. M. Pulley, L. Bastarache, K. Brown-Gentry , D. Wang, D. R. Masys, D. M. Roden, and D. C. Crawford (2010, May). PheWAS : Demonstrating the feasibility of a phenome-wide scan to discover gene--disease associations. Bioinformatics\/ 26\/ (9), 1205--1210
work page 2010
- [14]
-
[15]
Hripcsak, G. and D. J. Albers (2013, January). Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association\/ 20\/ (1), 117--121
work page 2013
-
[16]
Kruskal, W. H. and W. A. Wallis (1952, December). Use of Ranks in One-Criterion Variance Analysis . Journal of the American Statistical Association\/ 47\/ (260), 583--621
work page 1952
-
[17]
Lai, L. Y., F. Arshad, C. Areia, T. M. Alshammari, H. Alghoul, P. Casajust, X. Li, D. Dawoud, F. Nyberg, N. Pratt, G. Hripcsak, M. A. Suchard, D. Prieto-Alhambra , P. Ryan, and M. J. Schuemie (2022, March). Current Approaches to Vaccine Safety Using Observational Data : A Rationale for the EUMAEUS ( Evaluating Use of Methods for Adverse Events Under Surve...
work page 2022
-
[18]
Liao, K. P., J. Sun, T. A. Cai, N. Link, C. Hong, J. Huang, J. E. Huffman, J. Gronsbell, Y. Zhang, Y.-L. Ho, V. Castro, V. Gainer, S. N. Murphy, C. J. O'Donnell, J. M. Gaziano, K. Cho, P. Szolovits, I. S. Kohane, S. Yu, and T. Cai (2019, November). High-throughput multimodal automated phenotyping ( MAP ) with application to PheWAS . Journal of the America...
work page 2019
-
[19]
Lieu, T. A., M. Kulldorff, R. L. Davis, E. M. Lewis, E. Weintraub, K. Yih, R. Yin, J. S. Brown, and R. Platt (2007, October). Real- Time Vaccine Safety Surveillance for the Early Detection of Adverse Events . Medical Care\/ 45\/ (10), S89--S95
work page 2007
-
[20]
McCray, A. T. and S. J. Nelson (1995, March). The representation of meaning in the UMLS . Methods of Information in Medicine\/ 34\/ (1-2), 193--201
work page 1995
-
[21]
McNeil, M. M., J. Gee, E. S. Weintraub, E. A. Belongia, G. M. Lee, J. M. Glanz, J. D. Nordin, N. P. Klein, R. Baxter, A. L. Naleway, L. A. Jackson, S. B. Omer, S. J. Jacobsen, and F. DeStefano (2014, September). The Vaccine Safety Datalink : Successes and challenges monitoring vaccine safety. Vaccine\/ 32\/ (42), 5390--5398
work page 2014
-
[22]
Miotto, R., L. Li, B. A. Kidd, and J. T. Dudley (2016, May). Deep Patient : An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records . Scientific Reports\/ 6\/ (1), 26094
work page 2016
-
[23]
Nogues, I.-E., J. Wen, Y. Lin, M. Liu, S. K. Tedeschi, A. Geva, T. Cai, and C. Hong (2022, October). Weakly Semi-supervised phenotyping using Electronic Health records. Journal of Biomedical Informatics\/ 134 , 104175
work page 2022
-
[24]
Pathak, J., A. N. Kho, and J. C. Denny (2013, December). Electronic health records-driven phenotyping: Challenges, recent advances, and perspectives. Journal of the American Medical Informatics Association\/ 20\/ (e2), e206--e211
work page 2013
-
[25]
Pivovarov, R., A. J. Perotte, E. Grave, J. Angiolillo, C. H. Wiggins, and N. Elhadad (2015, December). Learning probabilistic phenotypes from heterogeneous EHR data. Journal of Biomedical Informatics\/ 58 , 156--165
work page 2015
-
[26]
Electronic Health Record Summarization over Heterogeneous and Irregularly Sampled Clinical Data
Pivovarov, Rimma (2015). Electronic Health Record Summarization over Heterogeneous and Irregularly Sampled Clinical Data
work page 2015
-
[27]
Richesson, R. L., W. E. Hammond, M. Nahm, D. Wixted, G. E. Simon, J. G. Robinson, A. E. Bauck, D. Cifelli, M. M. Smerek, J. Dickerson, R. L. Laws, R. A. Madigan, S. A. Rusincovitch, C. Kluchar, and R. M. Califf (2013, December). Electronic health records based phenotyping in next-generation clinical trials: A perspective from the NIH Health Care Systems C...
work page 2013
-
[28]
Richesson, R. L., S. A. Rusincovitch, D. Wixted, B. C. Batch, M. N. Feinglos, M. L. Miranda, W. E. Hammond, R. M. Califf, and S. E. Spratt (2013, December). A comparison of phenotype definitions for diabetes mellitus. Journal of the American Medical Informatics Association\/ 20 , e319--e326
work page 2013
-
[29]
Smith, J. C., B. D. Williamson, D. J. Cronkite, D. Park, J. M. Whitaker, M. F. McLemore, J. T. Osmanski, R. Winter, A. Ramaprasan, A. Kelley, M. Shea, S. Wittayanukorn, D. Stojanovic, Y. Zhao, S. Toh, K. B. Johnson, D. M. Aronoff, and D. S. Carrell (2024, February). Data-driven automated classification algorithms for acute health conditions: Applying PheN...
work page 2024
-
[30]
Steyerberg, E. (2009). Evaluation of Performance , pp.\ 255--280. New York, NY: Springer New York
work page 2009
-
[31]
Tian, T. Y., I. Zlateva, and D. R. Anderson (2013, December). Using electronic health records data to identify patients with chronic pain in a primary care setting. Journal of the American Medical Informatics Association\/ 20\/ (e2), e275--e280
work page 2013
-
[32]
Upadhyaya, S. G., D. H. Murphree, C. G. Ngufor, A. M. Knight, D. J. Cronk, R. R. Cima, T. B. Curry, J. Pathak, R. E. Carter, and D. J. Kor (2017, July). Automated Diabetes Case Identification Using Electronic Health Record Data at a Tertiary Care Facility . Mayo Clinic Proceedings. Innovations, Quality & Outcomes\/ 1\/ (1), 100--110
work page 2017
-
[33]
Walsh, K. E., S. L. Cutrona, S. Foy, M. A. Baker, S. Forrow, A. Shoaibi, P. A. Pawloski, M. Conroy, A. M. Fine, L. E. Nigrovic, N. Selvam, M. S. Selvan, W. O. Cooper, and S. Andrade (2013, November). Validation of anaphylaxis in the Food and Drug Administration 's Mini-Sentinel . Pharmacoepidemiology and Drug Safety\/ 22\/ (11), 1205--1213
work page 2013
-
[34]
Xu, D., C. Wang, A. Khan, N. Shang, Z. He, A. Gordon, I. J. Kullo, S. Murphy, Y. Ni, W.-Q. Wei, A. Gharavi, K. Kiryluk, C. Weng, and I. Ionita-Laza (2021, July). Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies. npj Digital Medicine\/ 4\/ (1), 116
work page 2021
-
[35]
Yan, C., H. H. Ong, M. E. Grabowska, M. S. Krantz, W.-C. Su, A. L. Dickson, J. F. Peterson, Q. Feng, D. M. Roden, C. M. Stein, V. E. Kerchberger, B. A. Malin, and W.-Q. Wei (2024, September). Large language models facilitate the generation of electronic health record phenotyping algorithms. Journal of the American Medical Informatics Association: JAMIA\/ ...
work page 2024
-
[36]
Yu, S., Y. Ma, J. Gronsbell, T. Cai, A. N. Ananthakrishnan, V. S. Gainer, S. E. Churchill, P. Szolovits, S. N. Murphy, I. S. Kohane, K. P. Liao, and T. Cai (2018, January). Enabling phenotypic big data with PheNorm . Journal of the American Medical Informatics Association: JAMIA\/ 25\/ (1), 54--60
work page 2018
-
[37]
Zhang, J., X. Zhang, K. Sun, X. Yang, C. Dai, and Y. Guo (2019, November). Unsupervised Annotation of Phenotypic Abnormalities via Semantic Latent Representations on Electronic Health Records . In 2019 IEEE International Conference on Bioinformatics and Biomedicine ( BIBM ) , San Diego, CA, USA, pp.\ 598--603. IEEE
work page 2019
-
[38]
Zhang, Y., T. Cai, S. Yu, K. Cho, C. Hong, J. Sun, J. Huang, Y.-L. Ho, A. N. Ananthakrishnan, Z. Xia, S. Y. Shaw, V. Gainer, V. Castro, N. Link, J. Honerlaw, S. Huang, D. Gagnon, E. W. Karlson, R. M. Plenge, P. Szolovits, G. Savova, S. Churchill, C. O'Donnell, S. N. Murphy, J. M. Gaziano, I. Kohane, T. Cai, and K. P. Liao (2019, December). High-throughput...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.