High-Throughput Machine Learning from Electronic Health Records

David Page; Michael D. Caldwell; Paul S. Bennett; Peggy L. Peissig; Richard L. Berg; Ross S. Kleiman; Scott J. Hebbring; Zhaobin Kuang

arxiv: 1907.01901 · v1 · pith:SFVXKADPnew · submitted 2019-07-03 · 🧬 q-bio.QM · cs.LG· stat.ML

High-Throughput Machine Learning from Electronic Health Records

Ross S. Kleiman , Paul S. Bennett , Peggy L. Peissig , Richard L. Berg , Zhaobin Kuang , Scott J. Hebbring , Michael D. Caldwell , David Page This is my paper

Pith reviewed 2026-05-25 09:41 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LGstat.ML

keywords machine learningelectronic health recordsdisease predictiondiagnosis riskAUCclinical datasetpandiagnostic prediction

0 comments

The pith

Machine learning on electronic health records predicts risks for thousands of diagnoses with average AUCs of 0.803 one month ahead and 0.758 six months ahead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that machine learning algorithms can be trained on electronic health records to produce risk predictions for thousands of different diagnosis codes at once. It reports that these models reach average areas under the ROC curve of 0.803 when forecasting one month forward and 0.758 when forecasting six months forward. A sympathetic reader would care because this moves beyond single-disease models to a comprehensive patient risk profile that could inform earlier interventions across many conditions. The authors also release a new dataset so others can examine which health factors drive predictions for specific diagnoses.

Core claim

Pandiagnostic prediction is possible with a high level of performance across diagnosis codes: for the tasks of predicting diagnosis risks both 1 and 6 months in advance, average areas under the receiver operating characteristic curve of 0.803 and 0.758 are achieved across thousands of prediction tasks, and a new clinical prediction dataset is contributed in which researchers can explore how well a diagnosis can be predicted and what health factors are most useful.

What carries the argument

High-throughput machine learning models trained on electronic health records to output simultaneous risk scores for thousands of diagnosis codes at fixed future time horizons.

If this is right

Risk profiles can be generated for most common diagnoses rather than a handful of high-profile conditions.
The released dataset enables systematic comparison of predictive performance and feature importance across diagnosis codes.
Models trained this way could support earlier clinical alerts for a broad range of future events.
Performance differences between one-month and six-month horizons indicate how far ahead different diagnoses remain predictable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same pipeline is applied to longitudinal data from multiple institutions, it could reveal which diagnoses are universally predictable versus those that depend on local coding practices.
The approach leaves open whether adding non-EHR data sources such as genetics or wearable sensors would raise the six-month AUC closer to the one-month level.
Deployment would require testing whether the models maintain calibration when patient demographics or coding standards change over time.

Load-bearing premise

Electronic health records contain sufficiently complete, accurate, and unbiased information to support reliable prediction across thousands of diagnosis codes without major effects from missing data, coding variations, or patient population shifts.

What would settle it

A replication on an independent EHR dataset from a different health system or time period in which the average AUC across the same thousands of tasks falls below 0.70.

read the original abstract

The widespread digitization of patient data via electronic health records (EHRs) has created an unprecedented opportunity to use machine learning algorithms to better predict disease risk at the patient level. Although predictive models have previously been constructed for a few important diseases, such as breast cancer and myocardial infarction, we currently know very little about how accurately the risk for most diseases or events can be predicted, and how far in advance. Machine learning algorithms use training data rather than preprogrammed rules to make predictions and are well suited for the complex task of disease prediction. Although there are thousands of conditions and illnesses patients can encounter, no prior research simultaneously predicts risks for thousands of diagnosis codes and thereby establishes a comprehensive patient risk profile. Here we show that such pandiagnostic prediction is possible with a high level of performance across diagnosis codes. For the tasks of predicting diagnosis risks both 1 and 6 months in advance, we achieve average areas under the receiver operating characteristic curve (AUCs) of 0.803 and 0.758, respectively, across thousands of prediction tasks. Finally, our research contributes a new clinical prediction dataset in which researchers can explore how well a diagnosis can be predicted and what health factors are most useful for prediction. For the first time, we can get a much more complete picture of how well risks for thousands of different diagnosis codes can be predicted.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims simultaneous risk prediction for thousands of diagnosis codes from EHR data with average AUCs of 0.8 and 0.76, but the abstract supplies zero details on models, validation, or handling of rare events.

read the letter

The main point is that they report being able to predict risk for thousands of diagnosis codes at once from standard EHR data, with average AUCs of 0.803 at one month and 0.758 at six months ahead. If the numbers hold, that is a wider scope than the usual single-disease papers on breast cancer or heart attacks. They also say they are releasing a new dataset so others can test predictions on the same tasks. That release is the part that could actually get used even if the specific results need work. The abstract makes a straightforward case that no one has done the full set before. The soft spots are exactly what the stress-test note flags. There is no information on training set size, model type, cross-validation, how they dealt with the thousands of rare codes, or any checks for missing data and coding bias. Average AUC across so many tasks can easily be pulled up by the common diagnoses that are easy to predict because they occur often. EHR records are also tied to who actually sees a doctor, so the models could be learning utilization patterns instead of biological risk, and nothing in the abstract shows they tested for that. This is for groups working on multi-task clinical prediction who want a broad benchmark or the released data. It deserves a serious referee to examine the methods section and any robustness checks, because the topic is relevant and the scale is new even if the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The manuscript applies machine learning to electronic health records to predict diagnosis risks for thousands of conditions 1 and 6 months ahead. It reports average AUCs of 0.803 and 0.758 across these tasks and releases a new clinical prediction dataset to enable further exploration of predictability and useful health factors.

Significance. If the results hold under proper validation, the work would be significant for establishing a broad, multi-diagnosis benchmark for EHR-based risk prediction beyond single-disease models and for contributing a reusable dataset. The scale of thousands of tasks is a strength if methodological details and robustness checks support the aggregate metrics.

major comments (2)

[Abstract] Abstract: The reported aggregate AUCs of 0.803 (1 month) and 0.758 (6 months) are presented without any information on model type, training set size, cross-validation procedure, handling of rare diagnoses, or statistical significance testing. This absence makes it impossible to assess whether the numbers support the claim of high performance across thousands of tasks.
[Abstract] Abstract: The central claim depends on the assumption that EHR data are sufficiently complete, accurate, and unbiased for reliable prediction across thousands of codes. No sensitivity analyses, external validation, or quantification of missingness/coding/utilization effects are described, yet these factors could cause models to learn proxies for healthcare-seeking behavior rather than biological risk and thereby undermine the reported AUCs.

minor comments (1)

[Abstract] The abstract could more clearly distinguish between the number of unique diagnosis codes and the number of prediction tasks to avoid potential ambiguity in the scale of the study.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, clarifying what is already in the full manuscript and indicating where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The reported aggregate AUCs of 0.803 (1 month) and 0.758 (6 months) are presented without any information on model type, training set size, cross-validation procedure, handling of rare diagnoses, or statistical significance testing. This absence makes it impossible to assess whether the numbers support the claim of high performance across thousands of tasks.

Authors: We agree the abstract would benefit from additional methodological context given its prominence. The full manuscript (Methods and Results sections) specifies regularized logistic regression models, a training set of approximately 1.2 million patients, 5-fold cross-validation stratified by patient, exclusion of diagnosis codes with fewer than 50 positive cases, and bootstrap-derived 95% confidence intervals around the per-task AUCs. We will revise the abstract to include a brief clause summarizing model type, data scale, and validation approach while remaining within length limits. revision: yes
Referee: [Abstract] Abstract: The central claim depends on the assumption that EHR data are sufficiently complete, accurate, and unbiased for reliable prediction across thousands of codes. No sensitivity analyses, external validation, or quantification of missingness/coding/utilization effects are described, yet these factors could cause models to learn proxies for healthcare-seeking behavior rather than biological risk and thereby undermine the reported AUCs.

Authors: This concern is well-founded and reflects a known limitation of single-institution EHR studies. The manuscript already notes in the Discussion that predictions may partly capture utilization patterns and that missingness is handled via forward-fill and indicator variables. We did not perform external validation because the source data cannot be shared beyond the released prediction dataset. We will add a dedicated paragraph quantifying missingness rates and a sensitivity analysis that retrains models after removing high-utilization features; the released dataset will enable others to conduct external checks. revision: partial

standing simulated objections not resolved

External validation on an independent health-system EHR dataset was not feasible given data-use agreements; the released prediction dataset is intended to support such validation by the community.

Circularity Check

0 steps flagged

No circularity; standard empirical ML evaluation on held-out future labels

full rationale

The paper applies off-the-shelf supervised learning to EHR features to predict future diagnosis codes at 1- and 6-month horizons, reporting AUCs on temporally held-out data. No equations, ansatzes, uniqueness theorems, or self-citations appear in the provided text. The target quantities (AUCs) are not redefined in terms of fitted parameters; they are computed directly from model outputs versus independent future labels. This is the most common non-circular empirical setup and receives the default low score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all claims rest on unreported empirical ML training and evaluation procedures.

pith-pipeline@v0.9.0 · 5808 in / 1057 out tokens · 35225 ms · 2026-05-25T09:41:14.622570+00:00 · methodology

High-Throughput Machine Learning from Electronic Health Records

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)