pith. sign in

arxiv: 2604.09913 · v1 · submitted 2026-04-10 · 📊 stat.ME · stat.ML

Performance of weakly-supervised electronic health record-based phenotyping methods in rare-outcome settings

Pith reviewed 2026-05-10 16:29 UTC · model grok-4.3

classification 📊 stat.ME stat.ML
keywords weakly-supervised learningelectronic health recordsphenotypingrare outcomessimulation studysilver labelsvaccine safety
0
0 comments X

The pith

Weakly-supervised methods for rare medical conditions in electronic health records perform unevenly and depend on tuning and silver label strength.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three algorithms that use proxy silver labels instead of costly true labels to flag patients with rare conditions in electronic health record data. It runs simulations that vary how rare the outcome is, how informative the silver labels are, and how complex the underlying data patterns become. A reader would care because many studies, such as vaccine safety monitoring, need to identify small groups of affected patients without reviewing every chart. The results show that performance differs across methods and metrics, with no clear winner, and that the algorithms succeed mainly when the silver labels already track the true outcome closely.

Core claim

Through simulations that range from simple to complex data-generating processes and include different outcome rates and silver label qualities, the study finds that PheNorm, MAP, and SureLDA do not consistently outperform one another on all accuracy measures. SureLDA often ranks high when silver labels are informative, yet all three methods are sensitive to chosen tuning parameters. The authors conclude that these approaches can be useful in rare-outcome settings when the proxies are strong predictors, but caution is warranted if the resulting probabilities feed into further analyses.

What carries the argument

An extensive simulation study that generates synthetic electronic health record data with varying outcome rarity and silver-label noise, then applies and evaluates three weakly-supervised phenotyping algorithms (PheNorm, MAP, SureLDA) that combine structured features with natural-language-processing outputs.

Load-bearing premise

The simulated data patterns and label noise levels match the statistical behavior of real electronic health record data for rare medical events.

What would settle it

A real-world chart review in a rare-outcome electronic health record cohort where the predicted probabilities from the three methods fail to enrich for true cases or show poor calibration when silver labels are only moderately predictive would falsify the claim that the methods work well under those conditions.

Figures

Figures reproduced from arXiv: 2604.09913 by Brian D. Williamson, Jennifer C. Nelson, Yunjing Hong.

Figure 1
Figure 1. Figure 1: Automated EHR phenotyping workflow with parallel processing. Silver-standard labels from electronic health records feed into PheNorm and MAP meth￾ods in parallel. PheNorm uses weakly-supervised denoising with dropout regression and EM estimation. MAP employs multimodal ensemble learning with mixture models. sureLDA integrates both approaches: using PheNorm probabilities as Dirichlet priors and incorporatin… view at source ↗
Figure 2
Figure 2. Figure 2: Comprehensive flowchart illustrating simplified, LDA, and complex data gener [PITH_FULL_IMAGE:figures/full_fig_p054_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Algorithm performance measured by AUC across different scenarios and data [PITH_FULL_IMAGE:figures/full_fig_p055_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Algorithm performance measured by mean squared error (MSE) across different [PITH_FULL_IMAGE:figures/full_fig_p056_4.png] view at source ↗
read the original abstract

Accurately identifying patients with specific medical conditions is a key challenge when using clinical data from electronic health records. Our objective was to comprehensively assess when weakly-supervised prediction methods, which use silver-standard labels (proxy measures of the true outcome) rather than gold-standard true labels, perform well in rare-outcome settings like vaccine safety studies. We compared three methods (PheNorm, MAP, and sureLDA) that combine structured features and features derived from clinical text using natural language processing, through an extensive simulation study with data-generating mechanisms ranging from simple to complex, varying outcome rates, and varying degrees of informative silver labels. We also considered using predicted probabilities to design a chart review validation study. No single method dominated the other across all prediction performance metrics. Probability-guided sampling selected a cohort enriched for patients with more mentions of important concepts in chart notes. SureLDA, the most complex of the three algorithms we considered, often performed well in simulations. Performance depended greatly on selected tuning parameters. Care should be taken when using weakly-supervised prediction methods in rare-outcome settings, particularly if the probabilities will be used in downstream analysis, but these methods can work well when silver labels are strong predictors of true outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper reports an extensive simulation study comparing three weakly-supervised EHR phenotyping methods (PheNorm, MAP, and SureLDA) for rare outcomes. Simulations vary outcome prevalence, silver-label informativeness, and data-generating mechanism complexity (simple to complex). Performance is assessed across multiple metrics; the authors also examine using predicted probabilities for probability-guided chart-review sampling. Main conclusions: no method dominates all metrics; SureLDA often performs well; results are highly sensitive to tuning parameters; the methods can succeed when silver labels are strong predictors, but caution is warranted for rare-outcome settings and downstream analyses.

Significance. If the simulation results are representative, the work supplies practical comparative guidance for weakly-supervised phenotyping in low-prevalence settings such as vaccine safety studies. The systematic variation of prevalence, label strength, and complexity is a clear strength that supports the claim of no universal winner and the conditional recommendation. The additional exploration of probability-guided sampling for validation is a useful extension. The study is reproducible in principle via its simulation framework, though the absence of real-data validation limits direct translation to practice.

major comments (1)
  1. Simulation design (throughout §3 and §4): The data-generating mechanisms vary prevalence and silver-label strength but do not incorporate key real-EHR features such as differential missingness by outcome rarity or correlated noise between structured codes and NLP mentions. Because the central practical claim ('these methods can work well when silver labels are strong predictors of true outcomes') rests on the simulations reflecting actual label-noise structures, this omission is load-bearing for the conditional recommendation in the Abstract and Discussion.
minor comments (3)
  1. Abstract: inconsistent capitalization of the algorithm name ('sureLDA' vs. 'SureLDA').
  2. Methods section: the precise definitions of the performance metrics (e.g., how AUC, F1, and calibration are computed under rare-event imbalance) should be stated explicitly rather than referenced only to prior work.
  3. Results: tables reporting performance across tuning-parameter grids would benefit from clearer indication of which parameter combinations were selected as 'default' versus 'optimized'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our simulation study. We address the major comment point by point below, with revisions to the manuscript where appropriate.

read point-by-point responses
  1. Referee: Simulation design (throughout §3 and §4): The data-generating mechanisms vary prevalence and silver-label strength but do not incorporate key real-EHR features such as differential missingness by outcome rarity or correlated noise between structured codes and NLP mentions. Because the central practical claim ('these methods can work well when silver labels are strong predictors of true outcomes') rests on the simulations reflecting actual label-noise structures, this omission is load-bearing for the conditional recommendation in the Abstract and Discussion.

    Authors: We agree that the simulations do not explicitly model differential missingness by outcome rarity or correlated noise between structured codes and NLP-derived features, both of which are plausible in real EHR data. Our data-generating processes were constructed to span simple to complex mechanisms while systematically varying prevalence and silver-label informativeness, but they remain abstractions and do not capture every possible dependence structure. We will revise the Discussion to add an explicit limitations paragraph acknowledging these omissions and noting that the reported performance advantages (particularly for sureLDA) and the sensitivity to tuning and label strength should be interpreted most confidently when silver labels are strong predictors, as directly varied in the simulations. This addition will also qualify the Abstract and Discussion recommendations accordingly. We do not plan to expand the simulation design itself at this stage, as the existing framework already isolates the effects of prevalence and label strength across multiple metrics. revision: partial

Circularity Check

0 steps flagged

No circularity in simulation-based comparative evaluation

full rationale

The paper conducts an empirical simulation study comparing PheNorm, MAP, and SureLDA under varied data-generating mechanisms, outcome prevalences, and silver-label strengths. No derivation chain, first-principles result, or prediction is claimed; performance metrics are computed directly from simulated ground truth. Recommendations follow observed empirical rankings rather than any reduction to fitted inputs or self-citations. External validity of the DGMs is a separate concern, not circularity.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 0 invented entities

Central claims rest on the fidelity of the simulation design to real EHR data and on standard statistical assumptions for generating binary outcomes and noisy silver labels; no new entities are postulated.

free parameters (3)
  • outcome prevalence
    Varied across simulations to represent rare-outcome settings; directly affects performance metrics.
  • silver-label informativeness
    Varied to test dependence on proxy quality; central to the comparative results.
  • tuning parameters
    Method-specific parameters whose choice strongly influences reported performance.
axioms (1)
  • domain assumption Simulated data-generating processes capture essential statistical features of real EHR phenotyping tasks
    Invoked to generalize simulation results to practice; stated in the objective and discussion of limitations.

pith-pipeline@v0.9.0 · 5519 in / 1440 out tokens · 43349 ms · 2026-05-10T16:29:50.217037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    Ahuja, Y., D. Zhou, Z. He, J. Sun, V. M. Castro, V. Gainer, S. N. Murphy, C. Hong, and T. Cai (2020, August). sureLDA : A multidisease automated phenotyping method for the electronic health record. Journal of the American Medical Informatics Association: JAMIA\/ 27\/ (8), 1235--1243

  2. [2]

    Mohseni Afshar, M

    Babazadeh, A., Z. Mohseni Afshar, M. Javanian, M. Mohammadnia-Afrouzi , A. Karkhah, J. Masrour-Roudsari , P. Sabbagh, V. Koppolu, V. K. Vasigala, and S. Ebrahimpour (2019, December). Influenza vaccination and Guillain -- Barr\'e syndrome: Reality or fear. Journal of Translational Internal Medicine\/ 7\/ (4), 137--142

  3. [3]

    Bach, S. H., B. He, A. Ratner, and C. R \'e (2017, August). Learning the Structure of Generative Models without Labeled Data . Proceedings of Machine Learning Research\/ 70 , 273--282

  4. [4]

    Becker, F., A. K. Smilde, and E. Acar (2023, July). Unsupervised EHR -based phenotyping via matrix and tensor decompositions. WIREs Data Mining and Knowledge Discovery\/ 13\/ (4), e1494

  5. [5]

    Bhatt, A. S., E. E. McElrath, B. L. Claggett, D. L. Bhatt, D. S. Adler, S. D. Solomon, and M. Vaduganathan (2021, August). Accuracy of ICD-10 Diagnostic Codes to Identify COVID-19 Among Hospitalized Patients . Journal of General Internal Medicine\/ 36\/ (8), 2532--2535

  6. [6]

    Campbell, R. L., M. L. Alpern, J. T. Li, J. B. Hagan, M. Motosue, A. F. Mullan, L. S. Harper, C. M. Lohse, and M. M. Jeffery (2023, February). Development of a machine learning algorithm based on administrative claims data for identification of ED anaphylaxis patient visits. Journal of Allergy and Clinical Immunology: Global\/ 2\/ (1), 61--68

  7. [7]

    Carrell, D. S., J. S. Floyd, S. Gruber, B. L. Hazlehurst, P. J. Heagerty, J. C. Nelson, B. D. Williamson, and R. Ball (2024, August). A general framework for developing computable clinical phenotype algorithms. Journal of the American Medical Informatics Association\/ 31\/ (8), 1785--1796

  8. [8]

    Carrell, D. S., S. Gruber, J. S. Floyd, M. A. Bann, K. L. Cushing-Haugen , R. L. Johnson, V. Graham, D. J. Cronkite, B. L. Hazlehurst, A. H. Felcher, C. A. Bejan, A. Kennedy, M. Shinde, S. Karami, Y. Ma, D. Stojanovic, Y. Zhao, R. Ball, and J. Nelson (2023, February). Improving Methods of Identifying Anaphylaxis for Medical Product Safety Surveillance Usi...

  9. [9]

    Davis, R. L., M. Kolczak, E. Lewis, J. Nordin, M. Goodman, D. K. Shay, R. Platt, S. Black, H. Shinefield, and R. T. Chen (2005, May). Active Surveillance of Vaccine Safety : A System to Detect Early Signs of Adverse Events . Epidemiology\/ 16\/ (3), 336--341

  10. [10]

    De Freitas, J. K., K. W. Johnson, E. Golden, G. N. Nadkarni, J. T. Dudley, E. P. Bottinger, B. S. Glicksberg, and R. Miotto (2021, September). Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records. Patterns\/ 2\/ (9), 100337

  11. [11]

    Dempster, A. P., N. M. Laird, and D. B. Rubin (1977, September). Maximum Likelihood from Incomplete Data Via the EM Algorithm . Journal of the Royal Statistical Society Series B: Statistical Methodology\/ 39\/ (1), 1--22

  12. [12]

    Denny, J. C., L. Bastarache, M. D. Ritchie, R. J. Carroll, R. Zink, J. D. Mosley, J. R. Field, J. M. Pulley, A. H. Ramirez, E. Bowton, M. A. Basford, D. S. Carrell, P. L. Peissig, A. N. Kho, J. A. Pacheco, L. V. Rasmussen, D. R. Crosslin, P. K. Crane, J. Pathak, S. J. Bielinski, S. A. Pendergrass, H. Xu, L. A. Hindorff, R. Li, T. A. Manolio, C. G. Chute, ...

  13. [13]

    Denny, J. C., M. D. Ritchie, M. A. Basford, J. M. Pulley, L. Bastarache, K. Brown-Gentry , D. Wang, D. R. Masys, D. M. Roden, and D. C. Crawford (2010, May). PheWAS : Demonstrating the feasibility of a phenome-wide scan to discover gene--disease associations. Bioinformatics\/ 26\/ (9), 1205--1210

  14. [14]

    Cossin, T

    Fert \'e , T., S. Cossin, T. Schaeverbeke, T. Barnetche, V. Jouhet, and B. P. Hejblum (2021, May). Automatic phenotyping of electronical health record: PheVis algorithm. Journal of Biomedical Informatics\/ 117 , 103746

  15. [15]

    Hripcsak, G. and D. J. Albers (2013, January). Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association\/ 20\/ (1), 117--121

  16. [16]

    Kruskal, W. H. and W. A. Wallis (1952, December). Use of Ranks in One-Criterion Variance Analysis . Journal of the American Statistical Association\/ 47\/ (260), 583--621

  17. [17]

    Lai, L. Y., F. Arshad, C. Areia, T. M. Alshammari, H. Alghoul, P. Casajust, X. Li, D. Dawoud, F. Nyberg, N. Pratt, G. Hripcsak, M. A. Suchard, D. Prieto-Alhambra , P. Ryan, and M. J. Schuemie (2022, March). Current Approaches to Vaccine Safety Using Observational Data : A Rationale for the EUMAEUS ( Evaluating Use of Methods for Adverse Events Under Surve...

  18. [18]

    Liao, K. P., J. Sun, T. A. Cai, N. Link, C. Hong, J. Huang, J. E. Huffman, J. Gronsbell, Y. Zhang, Y.-L. Ho, V. Castro, V. Gainer, S. N. Murphy, C. J. O'Donnell, J. M. Gaziano, K. Cho, P. Szolovits, I. S. Kohane, S. Yu, and T. Cai (2019, November). High-throughput multimodal automated phenotyping ( MAP ) with application to PheWAS . Journal of the America...

  19. [19]

    Lieu, T. A., M. Kulldorff, R. L. Davis, E. M. Lewis, E. Weintraub, K. Yih, R. Yin, J. S. Brown, and R. Platt (2007, October). Real- Time Vaccine Safety Surveillance for the Early Detection of Adverse Events . Medical Care\/ 45\/ (10), S89--S95

  20. [20]

    McCray, A. T. and S. J. Nelson (1995, March). The representation of meaning in the UMLS . Methods of Information in Medicine\/ 34\/ (1-2), 193--201

  21. [21]

    McNeil, M. M., J. Gee, E. S. Weintraub, E. A. Belongia, G. M. Lee, J. M. Glanz, J. D. Nordin, N. P. Klein, R. Baxter, A. L. Naleway, L. A. Jackson, S. B. Omer, S. J. Jacobsen, and F. DeStefano (2014, September). The Vaccine Safety Datalink : Successes and challenges monitoring vaccine safety. Vaccine\/ 32\/ (42), 5390--5398

  22. [22]

    Miotto, R., L. Li, B. A. Kidd, and J. T. Dudley (2016, May). Deep Patient : An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records . Scientific Reports\/ 6\/ (1), 26094

  23. [23]

    Nogues, I.-E., J. Wen, Y. Lin, M. Liu, S. K. Tedeschi, A. Geva, T. Cai, and C. Hong (2022, October). Weakly Semi-supervised phenotyping using Electronic Health records. Journal of Biomedical Informatics\/ 134 , 104175

  24. [24]

    Pathak, J., A. N. Kho, and J. C. Denny (2013, December). Electronic health records-driven phenotyping: Challenges, recent advances, and perspectives. Journal of the American Medical Informatics Association\/ 20\/ (e2), e206--e211

  25. [25]

    Pivovarov, R., A. J. Perotte, E. Grave, J. Angiolillo, C. H. Wiggins, and N. Elhadad (2015, December). Learning probabilistic phenotypes from heterogeneous EHR data. Journal of Biomedical Informatics\/ 58 , 156--165

  26. [26]

    Electronic Health Record Summarization over Heterogeneous and Irregularly Sampled Clinical Data

    Pivovarov, Rimma (2015). Electronic Health Record Summarization over Heterogeneous and Irregularly Sampled Clinical Data

  27. [27]

    Richesson, R. L., W. E. Hammond, M. Nahm, D. Wixted, G. E. Simon, J. G. Robinson, A. E. Bauck, D. Cifelli, M. M. Smerek, J. Dickerson, R. L. Laws, R. A. Madigan, S. A. Rusincovitch, C. Kluchar, and R. M. Califf (2013, December). Electronic health records based phenotyping in next-generation clinical trials: A perspective from the NIH Health Care Systems C...

  28. [28]

    Richesson, R. L., S. A. Rusincovitch, D. Wixted, B. C. Batch, M. N. Feinglos, M. L. Miranda, W. E. Hammond, R. M. Califf, and S. E. Spratt (2013, December). A comparison of phenotype definitions for diabetes mellitus. Journal of the American Medical Informatics Association\/ 20 , e319--e326

  29. [29]

    Smith, J. C., B. D. Williamson, D. J. Cronkite, D. Park, J. M. Whitaker, M. F. McLemore, J. T. Osmanski, R. Winter, A. Ramaprasan, A. Kelley, M. Shea, S. Wittayanukorn, D. Stojanovic, Y. Zhao, S. Toh, K. B. Johnson, D. M. Aronoff, and D. S. Carrell (2024, February). Data-driven automated classification algorithms for acute health conditions: Applying PheN...

  30. [30]

    Steyerberg, E. (2009). Evaluation of Performance , pp.\ 255--280. New York, NY: Springer New York

  31. [31]

    Tian, T. Y., I. Zlateva, and D. R. Anderson (2013, December). Using electronic health records data to identify patients with chronic pain in a primary care setting. Journal of the American Medical Informatics Association\/ 20\/ (e2), e275--e280

  32. [32]

    Upadhyaya, S. G., D. H. Murphree, C. G. Ngufor, A. M. Knight, D. J. Cronk, R. R. Cima, T. B. Curry, J. Pathak, R. E. Carter, and D. J. Kor (2017, July). Automated Diabetes Case Identification Using Electronic Health Record Data at a Tertiary Care Facility . Mayo Clinic Proceedings. Innovations, Quality & Outcomes\/ 1\/ (1), 100--110

  33. [33]

    Walsh, K. E., S. L. Cutrona, S. Foy, M. A. Baker, S. Forrow, A. Shoaibi, P. A. Pawloski, M. Conroy, A. M. Fine, L. E. Nigrovic, N. Selvam, M. S. Selvan, W. O. Cooper, and S. Andrade (2013, November). Validation of anaphylaxis in the Food and Drug Administration 's Mini-Sentinel . Pharmacoepidemiology and Drug Safety\/ 22\/ (11), 1205--1213

  34. [34]

    Xu, D., C. Wang, A. Khan, N. Shang, Z. He, A. Gordon, I. J. Kullo, S. Murphy, Y. Ni, W.-Q. Wei, A. Gharavi, K. Kiryluk, C. Weng, and I. Ionita-Laza (2021, July). Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies. npj Digital Medicine\/ 4\/ (1), 116

  35. [35]

    Yan, C., H. H. Ong, M. E. Grabowska, M. S. Krantz, W.-C. Su, A. L. Dickson, J. F. Peterson, Q. Feng, D. M. Roden, C. M. Stein, V. E. Kerchberger, B. A. Malin, and W.-Q. Wei (2024, September). Large language models facilitate the generation of electronic health record phenotyping algorithms. Journal of the American Medical Informatics Association: JAMIA\/ ...

  36. [36]

    Yu, S., Y. Ma, J. Gronsbell, T. Cai, A. N. Ananthakrishnan, V. S. Gainer, S. E. Churchill, P. Szolovits, S. N. Murphy, I. S. Kohane, K. P. Liao, and T. Cai (2018, January). Enabling phenotypic big data with PheNorm . Journal of the American Medical Informatics Association: JAMIA\/ 25\/ (1), 54--60

  37. [37]

    Zhang, K

    Zhang, J., X. Zhang, K. Sun, X. Yang, C. Dai, and Y. Guo (2019, November). Unsupervised Annotation of Phenotypic Abnormalities via Semantic Latent Representations on Electronic Health Records . In 2019 IEEE International Conference on Bioinformatics and Biomedicine ( BIBM ) , San Diego, CA, USA, pp.\ 598--603. IEEE

  38. [38]

    Zhang, Y., T. Cai, S. Yu, K. Cho, C. Hong, J. Sun, J. Huang, Y.-L. Ho, A. N. Ananthakrishnan, Z. Xia, S. Y. Shaw, V. Gainer, V. Castro, N. Link, J. Honerlaw, S. Huang, D. Gagnon, E. W. Karlson, R. M. Plenge, P. Szolovits, G. Savova, S. Churchill, C. O'Donnell, S. N. Murphy, J. M. Gaziano, I. Kohane, T. Cai, and K. P. Liao (2019, December). High-throughput...