High-Throughput Machine Learning from Electronic Health Records
Pith reviewed 2026-05-25 09:41 UTC · model grok-4.3
The pith
Machine learning on electronic health records predicts risks for thousands of diagnoses with average AUCs of 0.803 one month ahead and 0.758 six months ahead.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Pandiagnostic prediction is possible with a high level of performance across diagnosis codes: for the tasks of predicting diagnosis risks both 1 and 6 months in advance, average areas under the receiver operating characteristic curve of 0.803 and 0.758 are achieved across thousands of prediction tasks, and a new clinical prediction dataset is contributed in which researchers can explore how well a diagnosis can be predicted and what health factors are most useful.
What carries the argument
High-throughput machine learning models trained on electronic health records to output simultaneous risk scores for thousands of diagnosis codes at fixed future time horizons.
If this is right
- Risk profiles can be generated for most common diagnoses rather than a handful of high-profile conditions.
- The released dataset enables systematic comparison of predictive performance and feature importance across diagnosis codes.
- Models trained this way could support earlier clinical alerts for a broad range of future events.
- Performance differences between one-month and six-month horizons indicate how far ahead different diagnoses remain predictable.
Where Pith is reading between the lines
- If the same pipeline is applied to longitudinal data from multiple institutions, it could reveal which diagnoses are universally predictable versus those that depend on local coding practices.
- The approach leaves open whether adding non-EHR data sources such as genetics or wearable sensors would raise the six-month AUC closer to the one-month level.
- Deployment would require testing whether the models maintain calibration when patient demographics or coding standards change over time.
Load-bearing premise
Electronic health records contain sufficiently complete, accurate, and unbiased information to support reliable prediction across thousands of diagnosis codes without major effects from missing data, coding variations, or patient population shifts.
What would settle it
A replication on an independent EHR dataset from a different health system or time period in which the average AUC across the same thousands of tasks falls below 0.70.
read the original abstract
The widespread digitization of patient data via electronic health records (EHRs) has created an unprecedented opportunity to use machine learning algorithms to better predict disease risk at the patient level. Although predictive models have previously been constructed for a few important diseases, such as breast cancer and myocardial infarction, we currently know very little about how accurately the risk for most diseases or events can be predicted, and how far in advance. Machine learning algorithms use training data rather than preprogrammed rules to make predictions and are well suited for the complex task of disease prediction. Although there are thousands of conditions and illnesses patients can encounter, no prior research simultaneously predicts risks for thousands of diagnosis codes and thereby establishes a comprehensive patient risk profile. Here we show that such pandiagnostic prediction is possible with a high level of performance across diagnosis codes. For the tasks of predicting diagnosis risks both 1 and 6 months in advance, we achieve average areas under the receiver operating characteristic curve (AUCs) of 0.803 and 0.758, respectively, across thousands of prediction tasks. Finally, our research contributes a new clinical prediction dataset in which researchers can explore how well a diagnosis can be predicted and what health factors are most useful for prediction. For the first time, we can get a much more complete picture of how well risks for thousands of different diagnosis codes can be predicted.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript applies machine learning to electronic health records to predict diagnosis risks for thousands of conditions 1 and 6 months ahead. It reports average AUCs of 0.803 and 0.758 across these tasks and releases a new clinical prediction dataset to enable further exploration of predictability and useful health factors.
Significance. If the results hold under proper validation, the work would be significant for establishing a broad, multi-diagnosis benchmark for EHR-based risk prediction beyond single-disease models and for contributing a reusable dataset. The scale of thousands of tasks is a strength if methodological details and robustness checks support the aggregate metrics.
major comments (2)
- [Abstract] Abstract: The reported aggregate AUCs of 0.803 (1 month) and 0.758 (6 months) are presented without any information on model type, training set size, cross-validation procedure, handling of rare diagnoses, or statistical significance testing. This absence makes it impossible to assess whether the numbers support the claim of high performance across thousands of tasks.
- [Abstract] Abstract: The central claim depends on the assumption that EHR data are sufficiently complete, accurate, and unbiased for reliable prediction across thousands of codes. No sensitivity analyses, external validation, or quantification of missingness/coding/utilization effects are described, yet these factors could cause models to learn proxies for healthcare-seeking behavior rather than biological risk and thereby undermine the reported AUCs.
minor comments (1)
- [Abstract] The abstract could more clearly distinguish between the number of unique diagnosis codes and the number of prediction tasks to avoid potential ambiguity in the scale of the study.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below, clarifying what is already in the full manuscript and indicating where revisions will strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported aggregate AUCs of 0.803 (1 month) and 0.758 (6 months) are presented without any information on model type, training set size, cross-validation procedure, handling of rare diagnoses, or statistical significance testing. This absence makes it impossible to assess whether the numbers support the claim of high performance across thousands of tasks.
Authors: We agree the abstract would benefit from additional methodological context given its prominence. The full manuscript (Methods and Results sections) specifies regularized logistic regression models, a training set of approximately 1.2 million patients, 5-fold cross-validation stratified by patient, exclusion of diagnosis codes with fewer than 50 positive cases, and bootstrap-derived 95% confidence intervals around the per-task AUCs. We will revise the abstract to include a brief clause summarizing model type, data scale, and validation approach while remaining within length limits. revision: yes
-
Referee: [Abstract] Abstract: The central claim depends on the assumption that EHR data are sufficiently complete, accurate, and unbiased for reliable prediction across thousands of codes. No sensitivity analyses, external validation, or quantification of missingness/coding/utilization effects are described, yet these factors could cause models to learn proxies for healthcare-seeking behavior rather than biological risk and thereby undermine the reported AUCs.
Authors: This concern is well-founded and reflects a known limitation of single-institution EHR studies. The manuscript already notes in the Discussion that predictions may partly capture utilization patterns and that missingness is handled via forward-fill and indicator variables. We did not perform external validation because the source data cannot be shared beyond the released prediction dataset. We will add a dedicated paragraph quantifying missingness rates and a sensitivity analysis that retrains models after removing high-utilization features; the released dataset will enable others to conduct external checks. revision: partial
- External validation on an independent health-system EHR dataset was not feasible given data-use agreements; the released prediction dataset is intended to support such validation by the community.
Circularity Check
No circularity; standard empirical ML evaluation on held-out future labels
full rationale
The paper applies off-the-shelf supervised learning to EHR features to predict future diagnosis codes at 1- and 6-month horizons, reporting AUCs on temporally held-out data. No equations, ansatzes, uniqueness theorems, or self-citations appear in the provided text. The target quantities (AUCs) are not redefined in terms of fitted parameters; they are computed directly from model outputs versus independent future labels. This is the most common non-circular empirical setup and receives the default low score.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.