Joint Treatment Effect Estimation from Incomplete Healthcare Data: Temporal Causal Normalizing Flows with LLM-driven Evolutionary MNAR Imputation
Pith reviewed 2026-05-08 17:01 UTC · model grok-4.3
The pith
A pipeline of causal normalizing flows and evolutionary LLM imputation recovers treatment effects from gappy longitudinal EHRs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce a two-stage pipeline: CausalFlow-T, a DAG-constrained normalizing flow with LSTM-encoded patient history that performs exact invertible counterfactual inference while separating temporal confounding, paired with an LLM-driven evolutionary imputer that proposes executable operators to handle MNAR missingness. Ablations confirm that the DAG constraints and exact inference address distinct failure modes on synthetic benchmarks with known counterfactuals. On Swiss EHRs the full pipeline yields a per-protocol weight-loss difference of -0.98 kg (95% CI -1.01 to -0.96) favoring GLP-1 receptor agonists, consistent with randomized evidence despite realistically incomplete data.
What carries the argument
CausalFlow-T: a directed acyclic graph-constrained normalizing flow with long short-term memory encoding of patient history that enables exact invertible counterfactual inference and explicit separation of time-varying confounding.
If this is right
- DAG constraints and exact inference each correct distinct failure modes that neither compensates for in isolation on benchmarks with known counterfactuals.
- The evolutionary imputer achieves the best pooled rank on biomarker accuracy and causal metrics while statistical baselines degrade under high MNAR missingness.
- The pipeline produces estimates from realistically incomplete real-world EHRs that remain consistent with randomized evidence.
- Joint modeling of missingness and causal structure avoids separate preprocessing steps that can bias downstream treatment effect estimates.
Where Pith is reading between the lines
- The operator-proposing strategy could extend to other longitudinal domains with structured missingness, such as sensor data or financial time series.
- Replacing the LSTM history encoder with alternative sequence models might alter the degree of confounding separation achieved.
- Application to additional therapeutic areas with different missingness patterns would test whether consistency with trial results generalizes beyond the diabetes example.
Load-bearing premise
The DAG constraints and LSTM encoding in CausalFlow-T correctly separate time-varying confounding while the LLM evolutionary imputer preserves causal metrics without introducing bias under 30-80% MNAR missingness.
What would settle it
A substantial deviation between the pipeline's estimated weight-loss difference on the Swiss EHR cohort and the effect size reported in the corresponding randomized trials, or failure of the imputer to maintain average treatment effect recovery on a held-out semi-synthetic benchmark with 80% MNAR missingness.
Figures
read the original abstract
Target trial emulation (TTE) enables causal questions to be studied with observational data when randomized controlled trials (RCTs) are infeasible. Yet treatment-effect methods often address causal estimation, missingness, and temporal structure separately, limiting their robustness in electronic health records (EHRs), where time-varying confounding and missing-not-at-random (MNAR) biomarkers can reach 50%--80%. We propose a two-stage pipeline for treatment effect estimation from incomplete longitudinal EHRs. First, CausalFlow-T, a directed acyclic graph (DAG)-constrained normalizing flow with long short-term memory (LSTM)-encoded patient history, performs exact invertible counterfactual inference, avoiding approximation errors from variational inference and separating confounding through explicit causal structure. Ablations on four synthetic and one semi-synthetic benchmark with known counterfactuals show that DAG constraints and exact inference address distinct failure modes: neither compensates for the other. Second, because CausalFlow-T requires completed inputs, we introduce an LLM-driven evolutionary imputer that proposes executable imputation operators rather than individual entries, and evaluate it with three large language model (LLM) backends, including two open-source models. Across 30%--80% MNAR missingness, this imputer achieves the best pooled rank over biomarker and causal metrics, leading in point-wise accuracy and temporal extrapolation while preserving average treatment effect (ATE) recovery as statistical baselines degrade. On Swiss primary-care EHRs from adults with type 2 diabetes initiating a GLP-1 receptor agonist or SGLT-2 inhibitor, the pipeline estimates a per-protocol weight-loss difference of -0.98 kg [95% CI -1.01, -0.96] favoring GLP-1 receptor agonists, consistent with randomized evidence and obtained from realistically incomplete real-world EHRs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage pipeline for treatment effect estimation from incomplete longitudinal EHR data. CausalFlow-T is a DAG-constrained normalizing flow with LSTM-encoded patient history for exact invertible counterfactual inference that separates time-varying confounding. An LLM-driven evolutionary imputer proposes executable operators to handle 30-80% MNAR missingness. Ablations on four synthetic and one semi-synthetic benchmark show the components address distinct failure modes and preserve ATE recovery. On Swiss primary-care EHRs for type 2 diabetes patients initiating GLP-1 or SGLT-2 inhibitors, the pipeline yields a per-protocol weight-loss difference of -0.98 kg [95% CI -1.01, -0.96] favoring GLP-1 agonists, consistent with randomized evidence.
Significance. If the central claims hold, the work would advance target trial emulation by jointly addressing temporal confounding and high MNAR missingness in EHRs through exact inference and LLM-assisted imputation, potentially enabling more reliable causal estimates from real-world data where RCTs are infeasible. The benchmark results credit the separation of DAG constraints from exact inference, and the real-data consistency with RCTs is a falsifiable strength if uncertainty is properly quantified.
major comments (2)
- [Abstract] Abstract (real-data application): The reported 95% CI [-1.01, -0.96] for the -0.98 kg ATE implies a standard error of ~0.0125 kg. The manuscript gives no indication that variability from the LLM evolutionary imputer (stochastic proposals, operator selection, or multiple imputations under 30-80% MNAR) is propagated into the final ATE or CI. Since CausalFlow-T performs exact inference conditional on completed inputs, treating imputation as fixed would overstate precision and undermine the claimed consistency with randomized evidence.
- [Abstract] Abstract (imputer evaluation): The claim that the LLM-driven evolutionary imputer 'preserves average treatment effect (ATE) recovery as statistical baselines degrade' across 30-80% MNAR is load-bearing for the pipeline's validity, yet the abstract provides no quantitative evidence (e.g., bias in ATE or causal metrics before/after imputation) that the proposed operators avoid introducing systematic bias when the completed data are fed into the DAG-constrained CausalFlow-T.
minor comments (1)
- The abstract states evaluations with three LLM backends but does not specify which models or configurations were used for the final Swiss EHR analysis, limiting reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. The two major comments both concern the abstract's presentation of results; we address them point by point below and will revise the manuscript to improve clarity and transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract (real-data application): The reported 95% CI [-1.01, -0.96] for the -0.98 kg ATE implies a standard error of ~0.0125 kg. The manuscript gives no indication that variability from the LLM evolutionary imputer (stochastic proposals, operator selection, or multiple imputations under 30-80% MNAR) is propagated into the final ATE or CI. Since CausalFlow-T performs exact inference conditional on completed inputs, treating imputation as fixed would overstate precision and undermine the claimed consistency with randomized evidence.
Authors: We agree that the reported CI is conditional on a single completed dataset and does not propagate uncertainty arising from the stochastic LLM evolutionary imputation process (proposal generation, operator selection, or multiple runs). CausalFlow-T indeed performs exact inference given fixed inputs, so the current interval reflects only downstream variability. This is a genuine limitation of the presented analysis. In the revised manuscript we will explicitly state this conditioning in the abstract and results, add a dedicated limitations paragraph, and include a sensitivity analysis that repeats the real-data pipeline across several independent imputation realizations to illustrate the range of resulting ATE estimates. revision: partial
-
Referee: [Abstract] Abstract (imputer evaluation): The claim that the LLM-driven evolutionary imputer 'preserves average treatment effect (ATE) recovery as statistical baselines degrade' across 30-80% MNAR is load-bearing for the pipeline's validity, yet the abstract provides no quantitative evidence (e.g., bias in ATE or causal metrics before/after imputation) that the proposed operators avoid introducing systematic bias when the completed data are fed into the DAG-constrained CausalFlow-T.
Authors: The abstract summarizes the benchmark results reported in Section 4 and the supplementary material, where the imputer is shown to achieve the best pooled rank across biomarker and causal metrics and to maintain ATE recovery while statistical baselines degrade. To make this evidence visible at the abstract level, we will revise the abstract to include a concise quantitative statement drawn directly from the experiments (e.g., “with ATE bias remaining below 5 % at 80 % MNAR versus >20 % for baselines”). This addition will not change any experimental claims but will address the referee’s request for explicit quantitative support in the abstract. revision: yes
Circularity Check
No significant circularity; evaluations use external benchmarks with known counterfactuals and RCT consistency checks
full rationale
The derivation chain relies on DAG-constrained normalizing flows for exact inference and an LLM evolutionary imputer, with performance assessed via ablations on synthetic/semi-synthetic benchmarks that supply independent ground-truth counterfactuals. Real-world ATE estimates are cross-checked against published RCT results rather than being recovered from parameters fitted within the same dataset. No step equates a claimed prediction to its own fitted inputs by construction, nor does any load-bearing premise reduce to a self-citation whose validity is presupposed by the present work. Minor self-citations, if present, do not carry the central claims.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Causal relationships in the data can be represented as a directed acyclic graph
- ad hoc to paper Large language models can propose executable imputation operators that preserve causal metrics under MNAR missingness
invented entities (2)
-
CausalFlow-T
no independent evidence
-
LLM-driven evolutionary imputer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Treatment assignment (confounding).Patients with HRi ≥0.5 (i.e., higher risk) are treated with probability 0.70; the remainder with probability 0.30
-
[2]
Treatment effect.Treatment reduces the individual hazard ratio by 30%: HRtreated i = 0.70 HRi
-
[3]
Event times.Survival times are drawn from an exponential distribution, E(1/[h0 HRi]), and converted to a binary event indicator at each of the 11 timesteps. Censoring is applied only at the end of the study (all weight is on the final timestep), so within-study censoring is minimal
-
[4]
Purpose.Tests survival-model calibration under continuous confounding
Counterfactuals.The smooth individual cumulative incidence function Fi(t) = 1− exp(−h0 HRi t) is computed for both the always-treated and never-treated hazards and stored as ground-truth potential outcomes for evaluation. Purpose.Tests survival-model calibration under continuous confounding. The bimodal covariate structure creates two patient clusters. Th...
-
[5]
Baseline demographics.Age ∼ N(55,100) ; BMI from age with Gaussian noise; total cholesterol (TC) from a log-normal with age/BMI dependence; high-density lipoprotein (HDL) inversely related to age and BMI; smoking status ∼Bernoulli(0.6) ; diabetes proba- bility proportional to BMI; race and sex drawn from fixed population proportions
-
[6]
Time-varying dynamics.At each year t, age increments by 1; BMI, TC, and HDL evolve with Gaussian noise; hypertension (htn) transitions to 1 irreversibly with a probability driven by age, TC, and smoking; diabetes is absorbing once acquired. Systolic blood pressure (SBP) at timetis: SBP(t) = 70 SBP(t−1) SBP0 + 0.5 age(t) + 0.15 TC(t) + 10 smoker +ε, ε∼ N(0...
-
[7]
Treatment is absorbing: once assigned it persists
Treatment assignment (absorbing).At t=0, patients with hypertension are treated with probability 0.70; non-hypertensive patients with probability 0.30. Treatment is absorbing: once assigned it persists. At subsequent time steps, untreated hypertensive patients initiate treatment with probability 0.01 per year. 15 4.Mediated treatment effect.The antihypert...
-
[8]
The outcome is absorbing: rows after the first event are excluded
CVD outcome.Binary CVD event at each timestep from a logistic model with 12 risk-factor coefficients: log p 1−p =−10 + 0.005 age + 0.15 sex + 0.03 BMI + 0.015 SBPfinal −0.01 HDL + 0.01 TC + 0.25 smoker + 0.30 diabetes + 0.20 fam_hx + 0.10⊮[race = Black]−0.05⊮[race = Asian]. The outcome is absorbing: rows after the first event are excluded. Both always-tre...
work page 2025
-
[9]
A patient-month panel is constructed from selected demographic and clinical covariates
-
[10]
Within-patient time indices are created
-
[11]
Three patient-level latent traits are drawn once per patient
-
[12]
Two weakly autocorrelated shared state variables drive autocorrelation across series
-
[13]
External shocks introduce abrupt level changes
-
[14]
Ten variable-specific noise terms complete the stochastic structure
-
[15]
For each patient, a recursive simulation initializes each variable from covariates, latent traits, and noise, then iterates forward introducing autocorrelation, cross-variable dependence, treatment indicators, latent states, and shocks. Synthetic outcome:The continuous outcome is a nonlinear function of the synthetic covariates, clinical covariates, treat...
work page 2015
-
[16]
Generative mechanism.The model must define a joint distributionp(Y (0), Y (1) |X) rather than only conditional means E[Y|X, A] , as the latter is insufficient for individual-level counterfactual inference via the AAP procedure [Pearl, 2009]
work page 2009
-
[17]
Invertible abduction.The model must support exact or approximate recovery of patient- specific exogenous noise z from observations, implementing the twin-network assumption required for individual-level counterfactuals; without this, ˆy(0) and ˆy(1) are population-level contrasts rather than structural counterfactuals for the same individual
-
[18]
Distributional evaluability.The model must produce outputs compatible with our five reliability criteria (subgroup calibration, tail variance ratio, arm reconstruction error, HR 20 recovery, and training stability) all of which require access to the full potential outcome distributionsp(ˆy(a))rather than point predictions. Table 4 summarizes how each cand...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.