Joint Treatment Effect Estimation from Incomplete Healthcare Data: Temporal Causal Normalizing Flows with LLM-driven Evolutionary MNAR Imputation

David Catalan Cerezo; Franziska Ulrich; Jakob Martin Burgstaller; Nicola Serra; Oliver Senn; Olivia Jullian Parra; Patrick Owen; Sara Zoccheddu; Tom Forzy; William Sutcliffe

arxiv: 2605.05125 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI

Joint Treatment Effect Estimation from Incomplete Healthcare Data: Temporal Causal Normalizing Flows with LLM-driven Evolutionary MNAR Imputation

Olivia Jullian Parra , Sara Zoccheddu , David Catalan Cerezo , Tom Forzy , Franziska Ulrich , William Sutcliffe , Jakob Martin Burgstaller , Oliver Senn

show 2 more authors

Patrick Owen Nicola Serra

This is my paper

Pith reviewed 2026-05-08 17:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords treatment effect estimationmissing not at randomnormalizing flowscausal inferenceelectronic health recordstarget trial emulationimputationlongitudinal data

0 comments

The pith

A pipeline of causal normalizing flows and evolutionary LLM imputation recovers treatment effects from gappy longitudinal EHRs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a two-stage method for estimating causal treatment effects in healthcare when patient records contain substantial missing biomarker values that are not missing at random. The first stage uses a normalizing flow whose structure is constrained by a directed acyclic graph and whose history encoding uses LSTM to perform exact, invertible counterfactual inference that separates time-varying confounding. The second stage employs large language models to evolve and select executable imputation operators rather than filling individual values, preserving downstream causal metrics across 30 to 80 percent missingness. This joint approach matters because randomized trials are often infeasible for many treatment questions while conventional separate handling of missing data and confounding tends to bias results. On real Swiss primary-care records of adults with type 2 diabetes the pipeline produces a weight-loss advantage for GLP-1 receptor agonists over SGLT-2 inhibitors that matches randomized evidence.

Core claim

The authors introduce a two-stage pipeline: CausalFlow-T, a DAG-constrained normalizing flow with LSTM-encoded patient history that performs exact invertible counterfactual inference while separating temporal confounding, paired with an LLM-driven evolutionary imputer that proposes executable operators to handle MNAR missingness. Ablations confirm that the DAG constraints and exact inference address distinct failure modes on synthetic benchmarks with known counterfactuals. On Swiss EHRs the full pipeline yields a per-protocol weight-loss difference of -0.98 kg (95% CI -1.01 to -0.96) favoring GLP-1 receptor agonists, consistent with randomized evidence despite realistically incomplete data.

What carries the argument

CausalFlow-T: a directed acyclic graph-constrained normalizing flow with long short-term memory encoding of patient history that enables exact invertible counterfactual inference and explicit separation of time-varying confounding.

If this is right

DAG constraints and exact inference each correct distinct failure modes that neither compensates for in isolation on benchmarks with known counterfactuals.
The evolutionary imputer achieves the best pooled rank on biomarker accuracy and causal metrics while statistical baselines degrade under high MNAR missingness.
The pipeline produces estimates from realistically incomplete real-world EHRs that remain consistent with randomized evidence.
Joint modeling of missingness and causal structure avoids separate preprocessing steps that can bias downstream treatment effect estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The operator-proposing strategy could extend to other longitudinal domains with structured missingness, such as sensor data or financial time series.
Replacing the LSTM history encoder with alternative sequence models might alter the degree of confounding separation achieved.
Application to additional therapeutic areas with different missingness patterns would test whether consistency with trial results generalizes beyond the diabetes example.

Load-bearing premise

The DAG constraints and LSTM encoding in CausalFlow-T correctly separate time-varying confounding while the LLM evolutionary imputer preserves causal metrics without introducing bias under 30-80% MNAR missingness.

What would settle it

A substantial deviation between the pipeline's estimated weight-loss difference on the Swiss EHR cohort and the effect size reported in the corresponding randomized trials, or failure of the imputer to maintain average treatment effect recovery on a held-out semi-synthetic benchmark with 80% MNAR missingness.

Figures

Figures reproduced from arXiv: 2605.05125 by David Catalan Cerezo, Franziska Ulrich, Jakob Martin Burgstaller, Nicola Serra, Oliver Senn, Olivia Jullian Parra, Patrick Owen, Sara Zoccheddu, Tom Forzy, William Sutcliffe.

**Figure 1.** Figure 1: Stage 1 (LLM-driven Evolutionary Imputation): an LLM iteratively proposes candidate imputers g (k) , scores them via a self-supervised proxy s(g (k) ), and updates the running best g ⋆ k ; after K rounds, g ⋆ produces Dˆ. Stage 2 (CausalFlow-T): a temporal encoder conditions a DAGconstrained Causal MAF on patient history; counterfactual outcomes yˆ(a ′ ) are obtained via AAP. positive control); (2) LDL To… view at source ↗

**Figure 2.** Figure 2: Weight change over 1-year for adults with type 2 diabetes initiating GLP-1 receptor agonists view at source ↗

**Figure 3.** Figure 3: Directed-Acyclic-Graph of the generation of the synthetic variables and the outcome. view at source ↗

**Figure 4.** Figure 4: Distribution of the synthetic variables and outcome. view at source ↗

**Figure 5.** Figure 5: Expert-specified directed acyclic graph for the active-comparator, new-user cohort study view at source ↗

**Figure 6.** Figure 6: CFM imputer. The Transformer Encoder produces context ht from always-observed x ns 1:T . The MLP uθ is conditioned on ht and DAG evidence (Tt, Yt), trained with masked MSE on v ∗=x1−x0. At inference, missing values are recovered by ODE integration from Gaussian noise. where m (j) t = |pa(j)| −1 P k∈pa(j) MLPk→j ([h (k) t , h(j) t−1 ]). The causal structure is enforced in the encoder only. However the ELBO … view at source ↗

**Figure 7.** Figure 7: Evolutionary search progress for the LLM-driven imputer across 30%, 50%, and 80% view at source ↗

read the original abstract

Target trial emulation (TTE) enables causal questions to be studied with observational data when randomized controlled trials (RCTs) are infeasible. Yet treatment-effect methods often address causal estimation, missingness, and temporal structure separately, limiting their robustness in electronic health records (EHRs), where time-varying confounding and missing-not-at-random (MNAR) biomarkers can reach 50%--80%. We propose a two-stage pipeline for treatment effect estimation from incomplete longitudinal EHRs. First, CausalFlow-T, a directed acyclic graph (DAG)-constrained normalizing flow with long short-term memory (LSTM)-encoded patient history, performs exact invertible counterfactual inference, avoiding approximation errors from variational inference and separating confounding through explicit causal structure. Ablations on four synthetic and one semi-synthetic benchmark with known counterfactuals show that DAG constraints and exact inference address distinct failure modes: neither compensates for the other. Second, because CausalFlow-T requires completed inputs, we introduce an LLM-driven evolutionary imputer that proposes executable imputation operators rather than individual entries, and evaluate it with three large language model (LLM) backends, including two open-source models. Across 30%--80% MNAR missingness, this imputer achieves the best pooled rank over biomarker and causal metrics, leading in point-wise accuracy and temporal extrapolation while preserving average treatment effect (ATE) recovery as statistical baselines degrade. On Swiss primary-care EHRs from adults with type 2 diabetes initiating a GLP-1 receptor agonist or SGLT-2 inhibitor, the pipeline estimates a per-protocol weight-loss difference of -0.98 kg [95% CI -1.01, -0.96] favoring GLP-1 receptor agonists, consistent with randomized evidence and obtained from realistically incomplete real-world EHRs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core idea is a pipeline of DAG-constrained temporal normalizing flows for exact causal inference paired with an LLM evolutionary imputer for MNAR data, and it shows benchmark gains plus RCT-consistent results on diabetes EHRs, but the reported real-data precision looks too tight.

read the letter

This paper's main point is a two-stage method that first runs CausalFlow-T, a normalizing flow with explicit DAG constraints and LSTM history encoding, to get exact invertible counterfactuals, then feeds completed data from an LLM-driven evolutionary imputer that evolves operators instead of filling single values. The ablations on four synthetic and one semi-synthetic set show that the DAG structure and the exact flow each fix distinct failure modes the other does not cover, and the imputer holds up better than baselines on both point accuracy and ATE recovery under 30-80% MNAR missingness. On the Swiss primary-care records for type 2 diabetes patients starting GLP-1 or SGLT-2 drugs, the pipeline gives a weight-loss difference of -0.98 kg that lines up with randomized evidence. That match is the strongest practical signal here. The narrow CI around that estimate is the clearest soft spot. A 0.05 kg width from data with heavy missingness and LLM imputation implies very low uncertainty, yet the abstract gives no sign that variability across LLM runs, operator proposals, or multiple imputations is propagated into the final interval. If that step is missing, the precision and the RCT consistency rest on an untested assumption that imputation error is negligible. The work is aimed at people building causal methods for longitudinal observational data with realistic missingness, especially in healthcare. A reader who already works with target trial emulation or normalizing flows for counterfactuals will get the most from the benchmarks and the real-data check. It deserves a serious referee because the problem matters and the technical combination is new enough to warrant detailed review, even if the uncertainty handling needs more evidence.

Referee Report

2 major / 1 minor

Summary. The paper proposes a two-stage pipeline for treatment effect estimation from incomplete longitudinal EHR data. CausalFlow-T is a DAG-constrained normalizing flow with LSTM-encoded patient history for exact invertible counterfactual inference that separates time-varying confounding. An LLM-driven evolutionary imputer proposes executable operators to handle 30-80% MNAR missingness. Ablations on four synthetic and one semi-synthetic benchmark show the components address distinct failure modes and preserve ATE recovery. On Swiss primary-care EHRs for type 2 diabetes patients initiating GLP-1 or SGLT-2 inhibitors, the pipeline yields a per-protocol weight-loss difference of -0.98 kg [95% CI -1.01, -0.96] favoring GLP-1 agonists, consistent with randomized evidence.

Significance. If the central claims hold, the work would advance target trial emulation by jointly addressing temporal confounding and high MNAR missingness in EHRs through exact inference and LLM-assisted imputation, potentially enabling more reliable causal estimates from real-world data where RCTs are infeasible. The benchmark results credit the separation of DAG constraints from exact inference, and the real-data consistency with RCTs is a falsifiable strength if uncertainty is properly quantified.

major comments (2)

[Abstract] Abstract (real-data application): The reported 95% CI [-1.01, -0.96] for the -0.98 kg ATE implies a standard error of ~0.0125 kg. The manuscript gives no indication that variability from the LLM evolutionary imputer (stochastic proposals, operator selection, or multiple imputations under 30-80% MNAR) is propagated into the final ATE or CI. Since CausalFlow-T performs exact inference conditional on completed inputs, treating imputation as fixed would overstate precision and undermine the claimed consistency with randomized evidence.
[Abstract] Abstract (imputer evaluation): The claim that the LLM-driven evolutionary imputer 'preserves average treatment effect (ATE) recovery as statistical baselines degrade' across 30-80% MNAR is load-bearing for the pipeline's validity, yet the abstract provides no quantitative evidence (e.g., bias in ATE or causal metrics before/after imputation) that the proposed operators avoid introducing systematic bias when the completed data are fed into the DAG-constrained CausalFlow-T.

minor comments (1)

The abstract states evaluations with three LLM backends but does not specify which models or configurations were used for the final Swiss EHR analysis, limiting reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The two major comments both concern the abstract's presentation of results; we address them point by point below and will revise the manuscript to improve clarity and transparency.

read point-by-point responses

Referee: [Abstract] Abstract (real-data application): The reported 95% CI [-1.01, -0.96] for the -0.98 kg ATE implies a standard error of ~0.0125 kg. The manuscript gives no indication that variability from the LLM evolutionary imputer (stochastic proposals, operator selection, or multiple imputations under 30-80% MNAR) is propagated into the final ATE or CI. Since CausalFlow-T performs exact inference conditional on completed inputs, treating imputation as fixed would overstate precision and undermine the claimed consistency with randomized evidence.

Authors: We agree that the reported CI is conditional on a single completed dataset and does not propagate uncertainty arising from the stochastic LLM evolutionary imputation process (proposal generation, operator selection, or multiple runs). CausalFlow-T indeed performs exact inference given fixed inputs, so the current interval reflects only downstream variability. This is a genuine limitation of the presented analysis. In the revised manuscript we will explicitly state this conditioning in the abstract and results, add a dedicated limitations paragraph, and include a sensitivity analysis that repeats the real-data pipeline across several independent imputation realizations to illustrate the range of resulting ATE estimates. revision: partial
Referee: [Abstract] Abstract (imputer evaluation): The claim that the LLM-driven evolutionary imputer 'preserves average treatment effect (ATE) recovery as statistical baselines degrade' across 30-80% MNAR is load-bearing for the pipeline's validity, yet the abstract provides no quantitative evidence (e.g., bias in ATE or causal metrics before/after imputation) that the proposed operators avoid introducing systematic bias when the completed data are fed into the DAG-constrained CausalFlow-T.

Authors: The abstract summarizes the benchmark results reported in Section 4 and the supplementary material, where the imputer is shown to achieve the best pooled rank across biomarker and causal metrics and to maintain ATE recovery while statistical baselines degrade. To make this evidence visible at the abstract level, we will revise the abstract to include a concise quantitative statement drawn directly from the experiments (e.g., “with ATE bias remaining below 5 % at 80 % MNAR versus >20 % for baselines”). This addition will not change any experimental claims but will address the referee’s request for explicit quantitative support in the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluations use external benchmarks with known counterfactuals and RCT consistency checks

full rationale

The derivation chain relies on DAG-constrained normalizing flows for exact inference and an LLM evolutionary imputer, with performance assessed via ablations on synthetic/semi-synthetic benchmarks that supply independent ground-truth counterfactuals. Real-world ATE estimates are cross-checked against published RCT results rather than being recovered from parameters fitted within the same dataset. No step equates a claimed prediction to its own fitted inputs by construction, nor does any load-bearing premise reduce to a self-citation whose validity is presupposed by the present work. Minor self-citations, if present, do not carry the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that causal relationships form a DAG and that LLMs can generate unbiased imputation operators for MNAR mechanisms; these are domain and paper-specific assumptions without independent verification beyond the reported benchmarks.

axioms (2)

domain assumption Causal relationships in the data can be represented as a directed acyclic graph
Invoked to constrain the normalizing flow and separate confounding
ad hoc to paper Large language models can propose executable imputation operators that preserve causal metrics under MNAR missingness
Core premise of the second stage of the pipeline

invented entities (2)

CausalFlow-T no independent evidence
purpose: Exact invertible counterfactual inference with temporal LSTM-encoded history
New model introduced for the causal estimation stage
LLM-driven evolutionary imputer no independent evidence
purpose: Propose imputation operators rather than individual values for MNAR data
New imputation technique proposed for the missing-data stage

pith-pipeline@v0.9.0 · 5667 in / 1575 out tokens · 47398 ms · 2026-05-08T17:01:50.396895+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Treatment assignment (confounding).Patients with HRi ≥0.5 (i.e., higher risk) are treated with probability 0.70; the remainder with probability 0.30

work page
[2]

Treatment effect.Treatment reduces the individual hazard ratio by 30%: HRtreated i = 0.70 HRi

work page
[3]

Censoring is applied only at the end of the study (all weight is on the final timestep), so within-study censoring is minimal

Event times.Survival times are drawn from an exponential distribution, E(1/[h0 HRi]), and converted to a binary event indicator at each of the 11 timesteps. Censoring is applied only at the end of the study (all weight is on the final timestep), so within-study censoring is minimal

work page
[4]

Purpose.Tests survival-model calibration under continuous confounding

Counterfactuals.The smooth individual cumulative incidence function Fi(t) = 1− exp(−h0 HRi t) is computed for both the always-treated and never-treated hazards and stored as ground-truth potential outcomes for evaluation. Purpose.Tests survival-model calibration under continuous confounding. The bimodal covariate structure creates two patient clusters. Th...

work page
[5]

Baseline demographics.Age ∼ N(55,100) ; BMI from age with Gaussian noise; total cholesterol (TC) from a log-normal with age/BMI dependence; high-density lipoprotein (HDL) inversely related to age and BMI; smoking status ∼Bernoulli(0.6) ; diabetes proba- bility proportional to BMI; race and sex drawn from fixed population proportions

work page
[6]

Systolic blood pressure (SBP) at timetis: SBP(t) = 70 SBP(t−1) SBP0 + 0.5 age(t) + 0.15 TC(t) + 10 smoker +ε, ε∼ N(0,400), clipped to[80,200]

Time-varying dynamics.At each year t, age increments by 1; BMI, TC, and HDL evolve with Gaussian noise; hypertension (htn) transitions to 1 irreversibly with a probability driven by age, TC, and smoking; diabetes is absorbing once acquired. Systolic blood pressure (SBP) at timetis: SBP(t) = 70 SBP(t−1) SBP0 + 0.5 age(t) + 0.15 TC(t) + 10 smoker +ε, ε∼ N(0...

work page
[7]

Treatment is absorbing: once assigned it persists

Treatment assignment (absorbing).At t=0, patients with hypertension are treated with probability 0.70; non-hypertensive patients with probability 0.30. Treatment is absorbing: once assigned it persists. At subsequent time steps, untreated hypertensive patients initiate treatment with probability 0.01 per year. 15 4.Mediated treatment effect.The antihypert...

work page
[8]

The outcome is absorbing: rows after the first event are excluded

CVD outcome.Binary CVD event at each timestep from a logistic model with 12 risk-factor coefficients: log p 1−p =−10 + 0.005 age + 0.15 sex + 0.03 BMI + 0.015 SBPfinal −0.01 HDL + 0.01 TC + 0.25 smoker + 0.30 diabetes + 0.20 fam_hx + 0.10⊮[race = Black]−0.05⊮[race = Asian]. The outcome is absorbing: rows after the first event are excluded. Both always-tre...

work page 2025
[9]

A patient-month panel is constructed from selected demographic and clinical covariates

work page
[10]

Within-patient time indices are created

work page
[11]

Three patient-level latent traits are drawn once per patient

work page
[12]

Two weakly autocorrelated shared state variables drive autocorrelation across series

work page
[13]

External shocks introduce abrupt level changes

work page
[14]

Ten variable-specific noise terms complete the stochastic structure

work page
[15]

For each patient, a recursive simulation initializes each variable from covariates, latent traits, and noise, then iterates forward introducing autocorrelation, cross-variable dependence, treatment indicators, latent states, and shocks. Synthetic outcome:The continuous outcome is a nonlinear function of the synthetic covariates, clinical covariates, treat...

work page 2015
[16]

Generative mechanism.The model must define a joint distributionp(Y (0), Y (1) |X) rather than only conditional means E[Y|X, A] , as the latter is insufficient for individual-level counterfactual inference via the AAP procedure [Pearl, 2009]

work page 2009
[17]

Invertible abduction.The model must support exact or approximate recovery of patient- specific exogenous noise z from observations, implementing the twin-network assumption required for individual-level counterfactuals; without this, ˆy(0) and ˆy(1) are population-level contrasts rather than structural counterfactuals for the same individual

work page
[18]

GPT-5.4 first-valid

Distributional evaluability.The model must produce outputs compatible with our five reliability criteria (subgroup calibration, tail variance ratio, arm reconstruction error, HR 20 recovery, and training stability) all of which require access to the full potential outcome distributionsp(ˆy(a))rather than point predictions. Table 4 summarizes how each cand...

work page arXiv 2017

[1] [1]

Treatment assignment (confounding).Patients with HRi ≥0.5 (i.e., higher risk) are treated with probability 0.70; the remainder with probability 0.30

work page

[2] [2]

Treatment effect.Treatment reduces the individual hazard ratio by 30%: HRtreated i = 0.70 HRi

work page

[3] [3]

Censoring is applied only at the end of the study (all weight is on the final timestep), so within-study censoring is minimal

Event times.Survival times are drawn from an exponential distribution, E(1/[h0 HRi]), and converted to a binary event indicator at each of the 11 timesteps. Censoring is applied only at the end of the study (all weight is on the final timestep), so within-study censoring is minimal

work page

[4] [4]

Purpose.Tests survival-model calibration under continuous confounding

Counterfactuals.The smooth individual cumulative incidence function Fi(t) = 1− exp(−h0 HRi t) is computed for both the always-treated and never-treated hazards and stored as ground-truth potential outcomes for evaluation. Purpose.Tests survival-model calibration under continuous confounding. The bimodal covariate structure creates two patient clusters. Th...

work page

[5] [5]

Baseline demographics.Age ∼ N(55,100) ; BMI from age with Gaussian noise; total cholesterol (TC) from a log-normal with age/BMI dependence; high-density lipoprotein (HDL) inversely related to age and BMI; smoking status ∼Bernoulli(0.6) ; diabetes proba- bility proportional to BMI; race and sex drawn from fixed population proportions

work page

[6] [6]

Systolic blood pressure (SBP) at timetis: SBP(t) = 70 SBP(t−1) SBP0 + 0.5 age(t) + 0.15 TC(t) + 10 smoker +ε, ε∼ N(0,400), clipped to[80,200]

Time-varying dynamics.At each year t, age increments by 1; BMI, TC, and HDL evolve with Gaussian noise; hypertension (htn) transitions to 1 irreversibly with a probability driven by age, TC, and smoking; diabetes is absorbing once acquired. Systolic blood pressure (SBP) at timetis: SBP(t) = 70 SBP(t−1) SBP0 + 0.5 age(t) + 0.15 TC(t) + 10 smoker +ε, ε∼ N(0...

work page

[7] [7]

Treatment is absorbing: once assigned it persists

Treatment assignment (absorbing).At t=0, patients with hypertension are treated with probability 0.70; non-hypertensive patients with probability 0.30. Treatment is absorbing: once assigned it persists. At subsequent time steps, untreated hypertensive patients initiate treatment with probability 0.01 per year. 15 4.Mediated treatment effect.The antihypert...

work page

[8] [8]

The outcome is absorbing: rows after the first event are excluded

CVD outcome.Binary CVD event at each timestep from a logistic model with 12 risk-factor coefficients: log p 1−p =−10 + 0.005 age + 0.15 sex + 0.03 BMI + 0.015 SBPfinal −0.01 HDL + 0.01 TC + 0.25 smoker + 0.30 diabetes + 0.20 fam_hx + 0.10⊮[race = Black]−0.05⊮[race = Asian]. The outcome is absorbing: rows after the first event are excluded. Both always-tre...

work page 2025

[9] [9]

A patient-month panel is constructed from selected demographic and clinical covariates

work page

[10] [10]

Within-patient time indices are created

work page

[11] [11]

Three patient-level latent traits are drawn once per patient

work page

[12] [12]

Two weakly autocorrelated shared state variables drive autocorrelation across series

work page

[13] [13]

External shocks introduce abrupt level changes

work page

[14] [14]

Ten variable-specific noise terms complete the stochastic structure

work page

[15] [15]

For each patient, a recursive simulation initializes each variable from covariates, latent traits, and noise, then iterates forward introducing autocorrelation, cross-variable dependence, treatment indicators, latent states, and shocks. Synthetic outcome:The continuous outcome is a nonlinear function of the synthetic covariates, clinical covariates, treat...

work page 2015

[16] [16]

Generative mechanism.The model must define a joint distributionp(Y (0), Y (1) |X) rather than only conditional means E[Y|X, A] , as the latter is insufficient for individual-level counterfactual inference via the AAP procedure [Pearl, 2009]

work page 2009

[17] [17]

Invertible abduction.The model must support exact or approximate recovery of patient- specific exogenous noise z from observations, implementing the twin-network assumption required for individual-level counterfactuals; without this, ˆy(0) and ˆy(1) are population-level contrasts rather than structural counterfactuals for the same individual

work page

[18] [18]

GPT-5.4 first-valid

Distributional evaluability.The model must produce outputs compatible with our five reliability criteria (subgroup calibration, tail variance ratio, arm reconstruction error, HR 20 recovery, and training stability) all of which require access to the full potential outcome distributionsp(ˆy(a))rather than point predictions. Table 4 summarizes how each cand...

work page arXiv 2017