Prediction-based Inference in Electronic Health Record (EHR)-linked Biobanks with Clinically Informative Outcomes

Bhramar Mukherjee; Cheng-Han Yang; Xingran Chen; Zhenke Wu

arxiv: 2603.14356 · v2 · submitted 2026-03-15 · 📊 stat.AP

Prediction-based Inference in Electronic Health Record (EHR)-linked Biobanks with Clinically Informative Outcomes

Xingran Chen , Cheng-Han Yang , Zhenke Wu , Bhramar Mukherjee This is my paper

Pith reviewed 2026-05-15 11:40 UTC · model grok-4.3

classification 📊 stat.AP

keywords missing dataelectronic health recordsprediction-based inferencegenome-wide association studiesimputationbiobanksclinically informative missingnessAll of Us

0 comments

The pith

Prediction-based imputation methods improve power for genetic association studies in EHR biobanks when the missingness process is correctly specified or satisfies key independence conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates prediction-based methods that use machine learning predictions to impute missing biomarker values in electronic health record-linked biobank data. It establishes that these methods increase statistical power and efficiency for genome-wide association studies compared with complete-case analysis when the missing data mechanism is modeled accurately. When the model is misspecified, the improvements still hold only if genetic covariates are conditionally independent of missingness and imputation errors are independent of missingness. The work applies the methods to All of Us data for six laboratory biomarkers and shows they can recover known genetic associations with greater precision than weighted complete-case approaches. This matters because clinically driven test ordering creates informative missingness that standard methods cannot handle efficiently.

Core claim

Prediction-based (PB) inference methods, which impute missing continuous or binary biomarker outcomes using external machine learning predictions and then conduct association analyses while accounting for imputation uncertainty, substantially improve statistical power and estimation efficiency relative to complete-case analysis when the missing-data mechanism is correctly specified. Under misspecification, these gains require conditional independence between the covariates of interest and the missingness mechanism together with independence between imputation error and the missingness mechanism.

What carries the argument

Prediction-based (PB) inference, which fills missing biomarker outcomes via machine learning predictions and adjusts subsequent association tests for the resulting uncertainty or via weighting.

If this is right

PB methods outperform complete-case analysis in power for both continuous and binary outcomes when missingness is modeled correctly.
In All of Us data, PB methods replicate established GWAS signals for laboratory biomarkers with higher efficiency than weighted complete-case analysis.
Method performance varies with imputation quality and the specific observation process generating the missing biomarkers.
The approach applies across nine compared methods, including four PB variants and five traditional missing-data techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the independence conditions are routinely met in real EHR settings, PB methods could be used more widely even when perfect missingness models are unavailable.
Diagnostics for checking conditional independence between genetic variants and missingness would help decide when PB methods are safe to apply.
The framework could be extended to test other clinically informative missingness patterns, such as those driven by health-system factors not captured in the data.

Load-bearing premise

The missing-data mechanism must be correctly specified or the two independence conditions between covariates, imputation errors, and missingness must hold.

What would settle it

A simulation where missingness depends on an unmeasured factor that correlates with the genetic covariate of interest, producing lower power or inflated type I error for PB methods than for complete-case analysis.

read the original abstract

Electronic health record (EHR)-linked biobank data hold tremendous promise for large-scale discoveries via genome-wide association study (GWAS) on diverse phenotypic traits and biomarkers routinely captured in the EHR. However, heterogeneous missingness in biomarkers compromises the validity and efficiency of statistical analyses. Prediction-based (PB) inference methods meet this challenge by using external machine learning (ML) predictions to impute missing biomarker outcomes, thereby improving statistical power and estimation accuracy in association analyses. Yet, their suitability remains unclear when outcomes are subject to clinically informative observation processes, that is, when laboratory tests are ordered based on both measured and unmeasured patient- and health system-level characteristics. In this paper, we review the statistical underpinnings of popular PB methods and then evaluate nine methods, including four PB methods and five traditional missing-data approaches, under an encompassing set of outcome observation processes for continuous and binary outcomes. PB methods can substantially improve statistical power and estimation efficiency when the missing-data mechanism is correctly specified. Under misspecification, however, these gains require both conditional independence between the covariates of interest and the missingness mechanism and independence between imputation error and the missingness mechanism. Using All of Us (AoU) data, we perform GWAS of six laboratory biomarkers and demonstrate that PB methods can replicate known genetic associations while improving efficiency relative to (weighted) complete-case analysis (CCA). Their performance in replicating existing GWAS results in AoU also depends on imputation quality and the underlying missingness mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PB methods can boost power in EHR GWAS when missingness is modeled correctly, but the gains rest on independences that are hard to confirm in real data.

read the letter

The main thing to know is that this paper shows prediction-based imputation can improve efficiency and power over complete-case analysis in biobank GWAS, but only when the missing-data mechanism is correctly specified or when two specific independence conditions hold under misspecification. They lay this out clearly from the theory and then test it in simulations and All of Us data for six biomarkers. The All of Us results replicate known associations with tighter estimates than weighted complete-case analysis, which is the practical payoff they highlight. What the paper does well is run a systematic comparison of nine methods across a range of clinically informative observation processes for both continuous and binary outcomes, including new simulation setups that vary the missingness drivers. This gives a concrete sense of when the gains appear and when they do not. The review of the statistical underpinnings is also straightforward and useful for readers who want the conditions spelled out without heavy notation. The soft spot is that the real-data demonstration does not include direct checks or sensitivity analyses for the conditional independence between covariates and missingness or between imputation error and missingness. In EHR settings those conditions are unlikely to hold exactly because of unmeasured patient and system factors, and the paper does not show that the observed efficiency gains survive when they are mildly violated. The simulations flag the sensitivity, but the All of Us example leaves it open whether the gains come from meeting the conditions or from other features of the data. This work is for statisticians and analysts who handle missing biomarker data in large biobanks and want practical guidance on when prediction-based approaches are worth the extra modeling effort. A reader focused on GWAS efficiency or missing-data methods will get usable takeaways from the simulations and the real-data replication. It deserves a serious referee because it addresses a common practical problem with clear comparisons and flags the assumptions that matter. I would recommend sending it for peer review, with the main request being more explicit discussion or diagnostics around the independence conditions in the application section.

Referee Report

2 major / 1 minor

Summary. The manuscript reviews prediction-based (PB) inference methods for handling missing biomarker outcomes in EHR-linked biobank GWAS and evaluates nine methods (four PB, five traditional) via simulations under various outcome observation processes for continuous and binary traits. It claims PB methods substantially improve power and efficiency when the missing-data mechanism is correctly specified; under misspecification, gains require conditional independence between covariates of interest and the missingness mechanism plus independence between imputation error and missingness. Real-data analysis on All of Us replicates known genetic associations for six biomarkers with efficiency gains over (weighted) complete-case analysis, with performance depending on imputation quality and the underlying missingness mechanism.

Significance. If the stated conditions on the missingness mechanism hold, PB methods provide a practical route to leverage external ML predictions for imputing informative missing outcomes, increasing power and efficiency in large-scale genetic studies of EHR traits. The broad simulation framework and AoU replication offer concrete guidance for practitioners, though the unverifiable nature of the independence assumptions in real EHR observation processes limits the strength of the efficiency claims relative to complete-case analysis.

major comments (2)

[Abstract] Abstract: the claim that PB gains under misspecification 'require both conditional independence between the covariates of interest and the missingness mechanism and independence between imputation error and the missingness mechanism' is load-bearing for superiority over CCA, yet the All of Us GWAS demonstration reports efficiency gains without any verification or sensitivity analysis confirming these independences held rather than other data features.
[Simulation study] Simulation study section: the encompassing set of outcome observation processes is described at a high level without explicit data-generating equations, data exclusion rules, or how the two independence conditions are enforced versus violated, preventing verification that reported power gains arise from the claimed mechanism rather than simulation artifacts.

minor comments (1)

[Review of PB methods] Notation for the four PB methods in the review section could be standardized with explicit equations to aid comparison with the five traditional approaches.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and will revise the manuscript to improve transparency and address concerns about unverified assumptions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that PB gains under misspecification 'require both conditional independence between the covariates of interest and the missingness mechanism and independence between imputation error and the missingness mechanism' is load-bearing for superiority over CCA, yet the All of Us GWAS demonstration reports efficiency gains without any verification or sensitivity analysis confirming these independences held rather than other data features.

Authors: We agree the independence conditions are central to interpreting efficiency gains under misspecification, as demonstrated in our simulations. Direct verification is not feasible in real EHR data due to unmeasured factors. We will add a sensitivity analysis section exploring robustness to potential violations of these assumptions in the All of Us results. revision: partial
Referee: [Simulation study] Simulation study section: the encompassing set of outcome observation processes is described at a high level without explicit data-generating equations, data exclusion rules, or how the two independence conditions are enforced versus violated, preventing verification that reported power gains arise from the claimed mechanism rather than simulation artifacts.

Authors: We agree that explicit details will strengthen reproducibility. In the revision, we will include the full data-generating equations for each observation process, specify data exclusion rules, and detail how the conditional independence and imputation error independence conditions are enforced or violated. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external simulations, data benchmarks, and standard missing-data theory

full rationale

The paper reviews existing PB methods and evaluates nine approaches (four PB, five traditional) via simulation under varied outcome observation processes and via All of Us GWAS application. Performance gains are conditioned on correct missingness specification or on two explicit independence assumptions under misspecification; these are stated as requirements, not derived from the paper's own fitted parameters or equations. No derivation step equates a reported gain to a quantity defined by the same fitted inputs (e.g., no prediction of IE_1 ratios from parameters fitted to IE_1 data). Comparisons to complete-case analysis are external benchmarks. No self-citation chain is load-bearing for the central claims, and no ansatz or uniqueness result is smuggled in. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard missing-data theory plus the two independence conditions stated for the misspecified case; no new entities are introduced.

axioms (1)

domain assumption Missing-data mechanism can be modeled or approximated using external machine-learning predictions
Invoked throughout the description of prediction-based methods and their performance guarantees.

pith-pipeline@v0.9.0 · 5577 in / 1187 out tokens · 44726 ms · 2026-05-15T11:40:55.024180+00:00 · methodology

Prediction-based Inference in Electronic Health Record (EHR)-linked Biobanks with Clinically Informative Outcomes

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)