Generating Counterfactual Patient Timelines from Real-World Data
Pith reviewed 2026-05-16 11:39 UTC · model grok-4.3
The pith
An autoregressive generative model trained on real patient timelines can produce clinically plausible counterfactual trajectories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An autoregressive generative model trained on real-world data from over 300,000 patients and 400 million patient timeline entries can generate clinically plausible counterfactual trajectories. As a validation task, we applied the model to patients hospitalized with COVID-19 in 2023, modifying age, serum C-reactive protein, and serum creatinine to simulate 7-day outcomes. Increased in-hospital mortality was observed in counterfactual simulations with older age, elevated CRP, and elevated serum creatinine. Remdesivir prescriptions increased in simulations with higher CRP values and decreased in those with impaired kidney function. These counterfactual trajectories reproduced known clinical pat
What carries the argument
autoregressive generative model on sequential patient timeline data, which encodes event order to permit targeted modification of inputs for alternative outcome sequences.
If this is right
- Counterfactual simulation becomes feasible for exploring personalized treatment paths without new randomized trials.
- In silico trials can be conducted by generating large numbers of hypothetical trajectories under changed clinical conditions.
- Self-supervised training on raw timelines supplies a scalable base for clinical decision support tools.
- Established associations, including higher mortality with older age or elevated CRP, appear automatically in the generated paths.
Where Pith is reading between the lines
- The same training approach could be applied to generate timelines for rarer diseases by borrowing structure from common conditions.
- Embedding the model in electronic health record systems might enable real-time what-if queries during patient encounters.
- Extending the time horizon beyond seven days would test whether the model captures longer-term dynamics.
Load-bearing premise
Modifying specific input variables in the trained model will yield counterfactual outcomes that correctly reflect the causal relationships present in the observational training data.
What would settle it
A large independent cohort where age, CRP, or creatinine naturally vary shows simulated 7-day mortality rates that deviate substantially from the rates actually observed in that cohort.
read the original abstract
Counterfactual simulation - exploring hypothetical consequences under alternative clinical scenarios - holds promise for transformative applications such as personalized medicine and in silico trials. However, it remains challenging due to methodological limitations. Here, we show that an autoregressive generative model trained on real-world data from over 300,000 patients and 400 million patient timeline entries can generate clinically plausible counterfactual trajectories. As a validation task, we applied the model to patients hospitalized with COVID-19 in 2023, modifying age, serum C-reactive protein (CRP), and serum creatinine to simulate 7-day outcomes. Increased in-hospital mortality was observed in counterfactual simulations with older age, elevated CRP, and elevated serum creatinine. Remdesivir prescriptions increased in simulations with higher CRP values and decreased in those with impaired kidney function. These counterfactual trajectories reproduced known clinical patterns. These findings suggest that autoregressive generative models trained on real-world data in a self-supervised manner can establish a foundation for counterfactual clinical simulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that an autoregressive generative model trained self-supervised on real-world EHR data from over 300,000 patients and 400 million timeline entries can produce clinically plausible counterfactual patient trajectories. Validation consists of applying the model to 2023 COVID-19 hospitalizations, intervening on age, serum CRP, and serum creatinine, and observing that the generated 7-day outcomes reproduce established clinical patterns (increased in-hospital mortality with older age, elevated CRP, or elevated creatinine; increased remdesivir use with higher CRP and decreased use with impaired kidney function).
Significance. If the central claim holds, the work would provide a scalable, data-driven route to in silico counterfactual simulation for personalized medicine and trial design. The scale of the training corpus is a genuine strength, and the self-supervised autoregressive formulation is a natural fit for sequential EHR data. However, the reported validation only confirms reproduction of observational associations rather than recovery of interventional distributions, which limits the immediate significance for causal applications.
major comments (3)
- [Abstract] Abstract: The validation only demonstrates that the model reproduces known clinical associations when inputs are modified; this is consistent with capturing observational correlations but does not test whether the generated trajectories correspond to the outcomes that would occur under the hypothetical interventions. No ground-truth interventional data or sensitivity checks against unmeasured confounding are described.
- [Abstract] Abstract: No model architecture, training objective, sampling procedure for counterfactual generation, or quantitative evaluation metrics (e.g., calibration, AUROC for mortality, or ablation on intervention strength) are provided. Without these details it is impossible to assess whether the reported patterns arise from the claimed mechanism or from simpler memorization of marginal associations.
- [Abstract] Abstract: The manuscript reports directional changes in mortality and remdesivir prescriptions but supplies no statistical tests, confidence intervals, or error bars on the simulated outcomes, making it impossible to judge whether the observed effects exceed sampling variability in the generative process.
minor comments (1)
- [Abstract] Abstract: The phrase 'clinically plausible' is used without a precise definition or inter-rater agreement measure; a clearer operational criterion (e.g., alignment with published risk ratios) would strengthen the claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major point below and have revised the manuscript accordingly where feasible to strengthen the presentation of our work on autoregressive generative modeling for patient timelines.
read point-by-point responses
-
Referee: [Abstract] Abstract: The validation only demonstrates that the model reproduces known clinical associations when inputs are modified; this is consistent with capturing observational correlations but does not test whether the generated trajectories correspond to the outcomes that would occur under the hypothetical interventions. No ground-truth interventional data or sensitivity checks against unmeasured confounding are described.
Authors: We agree that the reported validation reproduces known observational associations rather than directly recovering interventional distributions, as randomized interventional data are unavailable in this real-world EHR corpus. This reproduction of established clinical patterns (e.g., age- and biomarker-dependent mortality) serves as a necessary plausibility check for the generative process. We have added an expanded limitations paragraph explicitly discussing the reliance on observational data, the absence of ground-truth interventions, and the potential role of unmeasured confounding, while clarifying that the approach provides a scalable simulation foundation rather than definitive causal estimates. revision: partial
-
Referee: [Abstract] Abstract: No model architecture, training objective, sampling procedure for counterfactual generation, or quantitative evaluation metrics (e.g., calibration, AUROC for mortality, or ablation on intervention strength) are provided. Without these details it is impossible to assess whether the reported patterns arise from the claimed mechanism or from simpler memorization of marginal associations.
Authors: These elements are described in the Methods and Experiments sections of the full manuscript, including the autoregressive transformer architecture, self-supervised next-event prediction objective, temperature-based sampling for counterfactual generation, calibration metrics, AUROC evaluations on held-out outcomes, and ablation studies varying intervention magnitude. To address the concern, we have revised the abstract to include a concise summary of the architecture and objective, and we have added explicit references to the supplementary ablation results on intervention strength. revision: yes
-
Referee: [Abstract] Abstract: The manuscript reports directional changes in mortality and remdesivir prescriptions but supplies no statistical tests, confidence intervals, or error bars on the simulated outcomes, making it impossible to judge whether the observed effects exceed sampling variability in the generative process.
Authors: We acknowledge the lack of uncertainty quantification in the original submission. In the revised manuscript we have added bootstrap-derived 95% confidence intervals and error bars to all reported outcome proportions, along with two-sided statistical tests comparing each counterfactual scenario against the unmodified baseline, with p-values provided in the figure captions and text. revision: yes
- Absence of ground-truth interventional data, which precludes direct empirical validation of causal recovery beyond observational pattern reproduction.
Circularity Check
No circularity: training and validation rely on external data and independent clinical knowledge
full rationale
The paper trains an autoregressive generative model self-supervised on an external real-world dataset of over 300,000 patients and 400 million timeline entries. Counterfactual trajectories are produced by intervening on input variables (age, CRP, creatinine) in the trained model. Validation checks reproduction of independently established clinical patterns for COVID-19 mortality and remdesivir use. No equations or steps reduce by construction to the model's own outputs, no parameters are fitted then relabeled as predictions, and no load-bearing self-citations or uniqueness theorems are invoked. The chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- model architecture and training hyperparameters
axioms (1)
- domain assumption Observational patient timeline data contains sufficient information to simulate counterfactual outcomes via input modification
Reference graph
Works this paper leans on
-
[1]
However, it remains challenging due to methodological limitations
Generating Counterfactual Patient Timelines from Real-World Data Yu Akagi, M.D.1, Tomohisa Seki, M.D., Ph.D.2, Toru Takiguchi, M.D., Ph.D.2, Hiromasa Ito, M.D., Ph.D.2, Yoshimasa Kawazoe, M.D., Ph.D.2,3, Kazuhiko Ohe, M.D., Ph.D.1,2 1 Department of Biomedical Informatics, Graduate School of Medicine, The University of Tokyo, Japan 2Department of Healthcar...
work page 2023
-
[2]
We then examined how simulated outcomes changed in response to these modifications
To test counterfactual reasoning, we modified the key input features: age, serum C-reactive protein (CRP), and serum creatinine. We then examined how simulated outcomes changed in response to these modifications. The resulting trajectories aligned with established clinical knowledge, such as increased mortality with increased serum CRP and reduced remdesi...
work page 2011
-
[3]
Records from 2023 were held out and exclusively used for counterfactual simulation
To enable unbiased evaluation of future data, only clinical records from January 2011 to December 2022 were used for model training. Records from 2023 were held out and exclusively used for counterfactual simulation. Counterfactual simulations of the COVID-19 admission cohort To assess the model’s capacity for counterfactual reasoning, we conducted simula...
work page 2011
-
[4]
We selected COVID-19 as the disease model due to its high prevalence and substantial mortality. It also spans diverse clinical contexts and is supported by extensive evidence generated during the global pandemic.13. We targeted three patient attributes for modification: age, serum C-reactive protein (CRP), and serum creatinine. These variables were chosen...
work page 2048
-
[5]
For general model evaluation purposes, we reserved the most recent 7% of patients (ordered by their first visit dates) as a holdout test set. The remaining patients were randomly split into training and validation sets, comprising 340,659 and 7,496 patients, respectively. The median age in the training set was 50 years, and 52.7% of patients were female. ...
work page 2023
-
[6]
The clinical potential of counterfactual AI models
Lee SI, Topol EJ. The clinical potential of counterfactual AI models. The Lancet. 2024 Feb;403(10428):717
work page 2024
-
[7]
Causal inference and counterfactual prediction in machine learning for actionable healthcare
Prosperi M, Guo Y, Sperrin M, Koopman JS, Min JS, He X, et al. Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nat Mach Intell. 2020 Jul 13;2(7):369–75
work page 2020
-
[8]
Creemers JHA, Ankan A, Roes KCB, Schröder G, Mehra N, Figdor CG, et al. In silico cancer immunotherapy trials uncover the consequences of therapy-specific response patterns for clinical trial design and outcome. Nat Commun. 2023 Apr 24;14(1):2348
work page 2023
-
[9]
Digital twins for health: a scoping review
Katsoulakis E, Wang Q, Wu H, Shahriyari L, Fletcher R, Liu J, et al. Digital twins for health: a scoping review. npj Digit Med. 2024 Mar 22;7(1):77
work page 2024
-
[10]
Kraljevic Z, Bean D, Shek A, Bendayan R, Hemingway H, Yeung JA, et al. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. The Lancet Digital Health. 2024 Apr 1;6(4):e281–90
work page 2024
-
[11]
Zero shot health trajectory prediction using transformer
Renc P, Jia Y, Samir AE, Was J, Li Q, Bates DW, et al. Zero shot health trajectory prediction using transformer. npj Digit Med. 2024 Sep 19;7(1):1–10
work page 2024
-
[12]
Available from: http://arxiv.org/abs/1706.03762
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Available from: http://arxiv.org/abs/2005.14165
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[14]
Available from: http://arxiv.org/abs/1606.08415
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Language models are unsupervised multitask learners
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. Technical report, OpenAI; 2019 [cited 2025 Mar 25]. Available from: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
work page 2019
-
[16]
Available from: http://arxiv.org/abs/1711.05101
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Bhimraj A, Morgan RL, Shumaker AH, Baden LR, Cheng VCC, Edwards KM, et al. Infectious Diseases Society of America Guidelines on the Treatment and Management of Patients With COVID-19 (September 2022). Clinical Infectious Diseases. 2024 Jun 27;78(7):e250–349
work page 2022
-
[18]
Efficiently Scaling Transformer Inference
Pope R, Douglas S, Chowdhery A, Devlin J, Bradbury J, Levskaya A, et al. Efficiently Scaling Transformer Inference [Internet]. arXiv; 2022 [cited 2025 Mar 26]. Available from: http://arxiv.org/abs/2211.05102
-
[19]
Li F, He M, Zhou M, Lai Y, Zhu Y, Liu Z, et al. Association of C-reactive protein with mortality in Covid-19 patients: a secondary analysis of a cohort study. Sci Rep. 2023 Nov 21;13(1):20361
work page 2023
-
[20]
Kidney disease is associated with in-hospital death of patients with COVID-19
Cheng Y, Luo R, Wang K, Zhang M, Wang Z, Dong L, et al. Kidney disease is associated with in-hospital death of patients with COVID-19. Kidney International. 2020 May;97(5):829–38
work page 2020
-
[21]
COVID-19 and kidney disease: insights from epidemiology to inform clinical practice
Mahalingasivam V, Su G, Iwagami M, Davids MR, Wetmore JB, Nitsch D. COVID-19 and kidney disease: insights from epidemiology to inform clinical practice. Nat Rev Nephrol. 2022 Aug;18(8):485–98
work page 2022
-
[22]
Evaluation of clinical prediction models (part 1): from development to external validation
Collins GS, Dhiman P, Ma J, Schlussel MM, Archer L, Van Calster B, et al. Evaluation of clinical prediction models (part 1): from development to external validation. BMJ. 2024 Jan 8;e074819
work page 2024
-
[23]
Individualized Treatment Effect Prediction with Machine Learning — Salient Considerations
Desai RJ, Glynn RJ, Solomon SD, Claggett B, Wang SV, Vaduganathan M. Individualized Treatment Effect Prediction with Machine Learning — Salient Considerations. NEJM Evidence [Internet]. 2024 Mar 26 [cited 2024 May 24];3(4). Available from: https://evidence.nejm.org/doi/10.1056/EVIDoa2300041
-
[24]
The next generation of evidence-based medicine
Subbiah V. The next generation of evidence-based medicine. Nat Med. 2023 Jan;29(1):49–58
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.