Prediction-based Inference in Electronic Health Record (EHR)-linked Biobanks with Clinically Informative Outcomes
Pith reviewed 2026-05-15 11:40 UTC · model grok-4.3
The pith
Prediction-based imputation methods improve power for genetic association studies in EHR biobanks when the missingness process is correctly specified or satisfies key independence conditions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prediction-based (PB) inference methods, which impute missing continuous or binary biomarker outcomes using external machine learning predictions and then conduct association analyses while accounting for imputation uncertainty, substantially improve statistical power and estimation efficiency relative to complete-case analysis when the missing-data mechanism is correctly specified. Under misspecification, these gains require conditional independence between the covariates of interest and the missingness mechanism together with independence between imputation error and the missingness mechanism.
What carries the argument
Prediction-based (PB) inference, which fills missing biomarker outcomes via machine learning predictions and adjusts subsequent association tests for the resulting uncertainty or via weighting.
If this is right
- PB methods outperform complete-case analysis in power for both continuous and binary outcomes when missingness is modeled correctly.
- In All of Us data, PB methods replicate established GWAS signals for laboratory biomarkers with higher efficiency than weighted complete-case analysis.
- Method performance varies with imputation quality and the specific observation process generating the missing biomarkers.
- The approach applies across nine compared methods, including four PB variants and five traditional missing-data techniques.
Where Pith is reading between the lines
- If the independence conditions are routinely met in real EHR settings, PB methods could be used more widely even when perfect missingness models are unavailable.
- Diagnostics for checking conditional independence between genetic variants and missingness would help decide when PB methods are safe to apply.
- The framework could be extended to test other clinically informative missingness patterns, such as those driven by health-system factors not captured in the data.
Load-bearing premise
The missing-data mechanism must be correctly specified or the two independence conditions between covariates, imputation errors, and missingness must hold.
What would settle it
A simulation where missingness depends on an unmeasured factor that correlates with the genetic covariate of interest, producing lower power or inflated type I error for PB methods than for complete-case analysis.
read the original abstract
Electronic health record (EHR)-linked biobank data hold tremendous promise for large-scale discoveries via genome-wide association study (GWAS) on diverse phenotypic traits and biomarkers routinely captured in the EHR. However, heterogeneous missingness in biomarkers compromises the validity and efficiency of statistical analyses. Prediction-based (PB) inference methods meet this challenge by using external machine learning (ML) predictions to impute missing biomarker outcomes, thereby improving statistical power and estimation accuracy in association analyses. Yet, their suitability remains unclear when outcomes are subject to clinically informative observation processes, that is, when laboratory tests are ordered based on both measured and unmeasured patient- and health system-level characteristics. In this paper, we review the statistical underpinnings of popular PB methods and then evaluate nine methods, including four PB methods and five traditional missing-data approaches, under an encompassing set of outcome observation processes for continuous and binary outcomes. PB methods can substantially improve statistical power and estimation efficiency when the missing-data mechanism is correctly specified. Under misspecification, however, these gains require both conditional independence between the covariates of interest and the missingness mechanism and independence between imputation error and the missingness mechanism. Using All of Us (AoU) data, we perform GWAS of six laboratory biomarkers and demonstrate that PB methods can replicate known genetic associations while improving efficiency relative to (weighted) complete-case analysis (CCA). Their performance in replicating existing GWAS results in AoU also depends on imputation quality and the underlying missingness mechanism.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reviews prediction-based (PB) inference methods for handling missing biomarker outcomes in EHR-linked biobank GWAS and evaluates nine methods (four PB, five traditional) via simulations under various outcome observation processes for continuous and binary traits. It claims PB methods substantially improve power and efficiency when the missing-data mechanism is correctly specified; under misspecification, gains require conditional independence between covariates of interest and the missingness mechanism plus independence between imputation error and missingness. Real-data analysis on All of Us replicates known genetic associations for six biomarkers with efficiency gains over (weighted) complete-case analysis, with performance depending on imputation quality and the underlying missingness mechanism.
Significance. If the stated conditions on the missingness mechanism hold, PB methods provide a practical route to leverage external ML predictions for imputing informative missing outcomes, increasing power and efficiency in large-scale genetic studies of EHR traits. The broad simulation framework and AoU replication offer concrete guidance for practitioners, though the unverifiable nature of the independence assumptions in real EHR observation processes limits the strength of the efficiency claims relative to complete-case analysis.
major comments (2)
- [Abstract] Abstract: the claim that PB gains under misspecification 'require both conditional independence between the covariates of interest and the missingness mechanism and independence between imputation error and the missingness mechanism' is load-bearing for superiority over CCA, yet the All of Us GWAS demonstration reports efficiency gains without any verification or sensitivity analysis confirming these independences held rather than other data features.
- [Simulation study] Simulation study section: the encompassing set of outcome observation processes is described at a high level without explicit data-generating equations, data exclusion rules, or how the two independence conditions are enforced versus violated, preventing verification that reported power gains arise from the claimed mechanism rather than simulation artifacts.
minor comments (1)
- [Review of PB methods] Notation for the four PB methods in the review section could be standardized with explicit equations to aid comparison with the five traditional approaches.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below and will revise the manuscript to improve transparency and address concerns about unverified assumptions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that PB gains under misspecification 'require both conditional independence between the covariates of interest and the missingness mechanism and independence between imputation error and the missingness mechanism' is load-bearing for superiority over CCA, yet the All of Us GWAS demonstration reports efficiency gains without any verification or sensitivity analysis confirming these independences held rather than other data features.
Authors: We agree the independence conditions are central to interpreting efficiency gains under misspecification, as demonstrated in our simulations. Direct verification is not feasible in real EHR data due to unmeasured factors. We will add a sensitivity analysis section exploring robustness to potential violations of these assumptions in the All of Us results. revision: partial
-
Referee: [Simulation study] Simulation study section: the encompassing set of outcome observation processes is described at a high level without explicit data-generating equations, data exclusion rules, or how the two independence conditions are enforced versus violated, preventing verification that reported power gains arise from the claimed mechanism rather than simulation artifacts.
Authors: We agree that explicit details will strengthen reproducibility. In the revision, we will include the full data-generating equations for each observation process, specify data exclusion rules, and detail how the conditional independence and imputation error independence conditions are enforced or violated. revision: yes
Circularity Check
No circularity: claims rest on external simulations, data benchmarks, and standard missing-data theory
full rationale
The paper reviews existing PB methods and evaluates nine approaches (four PB, five traditional) via simulation under varied outcome observation processes and via All of Us GWAS application. Performance gains are conditioned on correct missingness specification or on two explicit independence assumptions under misspecification; these are stated as requirements, not derived from the paper's own fitted parameters or equations. No derivation step equates a reported gain to a quantity defined by the same fitted inputs (e.g., no prediction of IE_1 ratios from parameters fitted to IE_1 data). Comparisons to complete-case analysis are external benchmarks. No self-citation chain is load-bearing for the central claims, and no ansatz or uniqueness result is smuggled in. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Missing-data mechanism can be modeled or approximated using external machine-learning predictions
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.