Recognition: unknown
Simulating clinical interventions with a generative multimodal model of human physiology
Pith reviewed 2026-05-07 07:39 UTC · model grok-4.3
The pith
A generative transformer models human health trajectories to predict disease risks and simulate interventions without task-specific training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HealthFormer is a decoder-only transformer trained to generate future states of individual physiological trajectories across 667 measurements in seven domains: blood biomarkers, body composition, sleep physiology, continuous glucose monitoring, gut microbiome, wearable-derived physiology, and behavior and medication exposure. From this single generative objective on the Human Phenotype Project cohort, the model enables transfer to four external cohorts for superior risk prediction on 27 of 30 endpoints and supports in-silico intervention simulation, where predicted directions agree with all 41 trial comparisons and means fall within reported 95% confidence intervals in 30 cases. The authors
What carries the argument
HealthFormer, the decoder-only transformer that generates tokenized multimodal physiological trajectories and supports conditioning on intervention tokens to produce simulated future states.
Load-bearing premise
That conditioning on intervention tokens yields accurate causal simulations of health changes rather than replaying correlations from the observational training data.
What would settle it
A new randomized controlled trial in which the model's predicted mean effect for the primary outcome falls outside the trial's 95% confidence interval for more than half of the measured endpoints.
read the original abstract
Understanding how human health changes over time, and why responses to interventions vary between individuals, remains a central challenge in medicine. Here we present HealthFormer, a decoder-only transformer that models the human physiological trajectory generatively, by training on data from the Human Phenotype Project, a multi-visit cohort of over 15,000 deeply phenotyped individuals. We tokenise each participant's health trajectory across 667 measurements spanning seven domains: blood biomarkers, body composition, sleep physiology, continuous glucose monitoring, gut microbiome, wearable-derived physiology, and behaviour and medication exposure. We train HealthFormer to forecast individual physiological trajectories across these domains, and from this single generative objective a range of clinically relevant tasks can be expressed as queries on the model. We show that, without task-specific training, HealthFormer transfers to four independent cohorts and improves prediction for 27 of 30 incident-disease and mortality endpoints, exceeding established clinical risk scores in every comparison. We further show that the model can simulate interventions in silico: in a held-out personalised-nutrition trial, intervention-conditioned predictions recover individual six-month biomarker changes (e.g., Pearson r = 0.78 for diastolic blood pressure). Across 41 randomised intervention-outcome comparisons drawn from published trials, our results show that the predicted direction of effect agrees in every case, and the predicted mean falls within the reported 95% confidence interval in 30 cases. We position HealthFormer as an initial health world model, from which forecasting, risk stratification, and intervention-conditioned simulation arise as queries, providing a basis for clinical digital twins.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HealthFormer, a decoder-only transformer trained on tokenized multimodal physiological trajectories from the Human Phenotype Project (over 15,000 individuals, 667 measurements across seven domains). From a single next-token prediction objective, the model is shown to transfer zero-shot to four external cohorts, improving prediction on 27 of 30 incident-disease and mortality endpoints over established clinical risk scores. It is further used to simulate interventions by conditioning on intervention tokens, recovering individual biomarker changes in a held-out nutrition trial (e.g., r=0.78 for diastolic blood pressure) and agreeing with the direction of effect in all 41 published RCT comparisons, with the predicted mean inside the reported 95% CI in 30 cases.
Significance. If the intervention-conditioning results reflect causal structure rather than observational correlations, the work would constitute a substantial advance toward generative health world models and in-silico clinical trials. The single-model multi-task capability, the scale of the multimodal training data, and the direct empirical comparisons to independent cohorts and published RCTs are clear strengths that provide falsifiable predictions. The approach offers a unified framework from which forecasting, risk stratification, and intervention simulation emerge as queries on the same generative model.
major comments (3)
- [in-silico intervention results (abstract and Results)] The headline claim that intervention-conditioned generation produces clinically meaningful simulations rests on the assumption that prepending or inserting intervention tokens yields counterfactual trajectories. All training trajectories are observational; the manuscript provides no identification argument, back-door adjustment, or ablation (e.g., randomizing the intervention token independently of preceding state) to distinguish causal capture from replay of marginal associations. This is load-bearing for the 41-RCT agreement results reported in the abstract and the in-silico intervention section.
- [held-out nutrition trial evaluation (Results)] In the held-out personalized-nutrition trial evaluation, the reported Pearson r = 0.78 for diastolic blood pressure is presented as evidence of accurate individual-level simulation. The manuscript does not detail the exact procedure for inserting the intervention token relative to each participant's baseline state or whether any trial participants' data could have appeared in similar observational contexts during training, leaving open the possibility of leakage or selection effects.
- [external cohort transfer results (Results)] The transfer results claim improvement on 27 of 30 endpoints over clinical risk scores. The manuscript should report the precise statistical tests, confidence intervals, and multiple-testing correction used for these comparisons, as the central claim of superior generalization depends on demonstrating that the observed improvements are not attributable to chance or post-hoc cohort selection.
minor comments (3)
- [Abstract] The abstract states 'seven domains' but then lists blood biomarkers, body composition, sleep physiology, continuous glucose monitoring, gut microbiome, wearable-derived physiology, and behaviour and medication exposure. Clarifying whether behaviour/medication constitutes one or two domains would remove ambiguity.
- [Methods (tokenization)] The tokenization and discretization rules for the 667 continuous measurements are described at a high level; providing the exact binning thresholds or vocabulary construction procedure in the Methods would aid reproducibility.
- [Figures in Results] Several figures comparing model predictions to trial outcomes would benefit from explicit annotation of the 95% confidence intervals from the original trials alongside the model's predicted means.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We appreciate the emphasis on clarifying the observational basis of our intervention simulations and on providing rigorous statistical details for the transfer results. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [in-silico intervention results (abstract and Results)] The headline claim that intervention-conditioned generation produces clinically meaningful simulations rests on the assumption that prepending or inserting intervention tokens yields counterfactual trajectories. All training trajectories are observational; the manuscript provides no identification argument, back-door adjustment, or ablation (e.g., randomizing the intervention token independently of preceding state) to distinguish causal capture from replay of marginal associations. This is load-bearing for the 41-RCT agreement results reported in the abstract and the in-silico intervention section.
Authors: We agree that the training data consists solely of observational trajectories and that the model therefore learns conditional associations rather than performing explicit causal inference. We cannot supply a formal identification argument or back-door adjustment because the dataset contains no randomized interventions. The intervention tokens encode observed events (e.g., medication initiation or dietary change) that occurred in the training trajectories, and the generated outputs reflect the conditional distributions the model has learned following those tokens. The reported agreement with independent RCT results is offered as empirical evidence of practical utility, not as a demonstration of causal identification. In the revision we will (i) replace language such as “simulate interventions” with “generate trajectories conditioned on intervention tokens,” (ii) add an explicit limitations paragraph stating the observational nature of the training data and the distinction from causal models, and (iii) include a new ablation in which intervention tokens are randomly reassigned; the resulting loss of alignment with RCT outcomes will be reported to show that the model is sensitive to the specific token context rather than marginal associations alone. revision: partial
-
Referee: [held-out nutrition trial evaluation (Results)] In the held-out personalised-nutrition trial evaluation, the reported Pearson r = 0.78 for diastolic blood pressure is presented as evidence of accurate individual-level simulation. The manuscript does not detail the exact procedure for inserting the intervention token relative to each participant's baseline state or whether any trial participants' data could have appeared in similar observational contexts during training, leaving open the possibility of leakage or selection effects.
Authors: We thank the referee for noting this omission. In the revised manuscript we will add a dedicated methods subsection that specifies the token-insertion protocol: for each held-out trial participant the intervention token is placed immediately after the baseline measurement tokens, after which the model autoregressively generates the subsequent six-month trajectory. We confirm that no trial participants were present in the training set and that the specific intervention contexts (personalized nutrition assignments) do not appear in the observational sequences used for training. We will also report the baseline biomarker distributions of the trial cohort relative to the training population to allow readers to assess potential selection effects. revision: yes
-
Referee: [external cohort transfer results (Results)] The transfer results claim improvement on 27 of 30 endpoints over clinical risk scores. The manuscript should report the precise statistical tests, confidence intervals, and multiple-testing correction used for these comparisons, as the central claim of superior generalization depends on demonstrating that the observed improvements are not attributable to chance or post-hoc cohort selection.
Authors: We agree that these statistical details are required. In the revision we will state that AUC comparisons were performed with DeLong’s test, that 95 % confidence intervals were obtained by bootstrap resampling (1 000 iterations), and that p-values across the 30 endpoints were adjusted for multiple testing with the Benjamini–Hochberg procedure (FDR = 0.05). After correction, statistically significant improvements remained for 27 endpoints. The full table of test statistics, confidence intervals, and adjusted p-values will be added to the Results section and supplementary material. revision: yes
- The request for an identification argument, back-door adjustment, or causal ablation demonstrating that intervention conditioning captures counterfactual effects rather than observational associations. Because the training data contains only observational trajectories, no such formal causal identification is possible within the current framework.
Circularity Check
No significant circularity: generative model validated on independent external data
full rationale
The paper trains a decoder-only transformer on observational trajectories from the Human Phenotype Project and expresses forecasting, risk prediction, and intervention simulation as queries to the same generative model. All reported performance claims (transfer to four independent cohorts for 27/30 endpoints, agreement with 41 published RCTs on direction and 30/41 means inside 95% CI, and held-out nutrition trial r=0.78) are evaluated on data sources external to the training distribution. No equations or steps are quoted in which a prediction is defined as a fitted parameter from the same inputs, an ansatz is smuggled via self-citation, or a uniqueness result is imported from the authors' prior work. The intervention conditioning is simply next-token prediction on sequences that include intervention tokens observed in the training data; the subsequent comparison to randomized trials constitutes an independent check rather than a reduction by construction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- Tokenization vocabulary and discretization rules for 667 measurements
- Transformer architecture hyperparameters (layers, heads, embedding size)
axioms (2)
- domain assumption Human physiological trajectories can be usefully represented as sequences of tokens drawn from a fixed vocabulary across seven heterogeneous domains.
- domain assumption Conditioning the model on intervention tokens produces forecasts that reflect the causal effect of that intervention rather than observational associations.
invented entities (1)
-
HealthFormer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Reicher, L. et al. Deep phenotyping of health-disease continuum in the Human Phenotype Project. Nat. Med. 31 , 3191–3203 (2025)
2025
-
[2]
& Trayanova, N
Laubenbacher, R., Mehrad, B., Shmulevich, I. & Trayanova, N. Digital twins in medicine. Nat. Comput. Sci. 4 , 184–191 (2024)
2024
-
[3]
Björnsson, B. et al. Digital twins to personalize medicine. Genome Med. 12 , 4 (2019)
2019
-
[4]
Sadée, C. et al. Medical digital twins: enabling precision medicine and medical artificial intelligence. Lancet Digit. Health 7 , 100864 (2025)
2025
-
[5]
Kamel Boulos, M. N. & Zhang, P. Digital twins: from personalised medicine to precision public health. J. Pers. Med. 11 , (2021)
2021
-
[6]
Shmatko, A. et al. Learning the natural history of human disease with generative transformers. Nature 647 , 248–256 (2025)
2025
-
[7]
Waxler, S. et al. Generative Medical Event Models Improve with Scale. arXiv (2025) doi:10.48550/arxiv.2508.12104
-
[8]
Li, Y. et al. BEHRT: transformer for electronic health records. Sci. Rep. 10 , 7155 (2020)
2020
-
[9]
& Zhi, D
Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digital Med. 4 , 86 (2021)
2021
-
[10]
Yang, Z., Mitra, A., Liu, W., Berlowitz, D. & Yu, H. TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat. Commun. 14 , 7857 (2023)
2023
-
[11]
Kraljevic, Z. et al. Foresight-a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. Lancet Digit. Health 6 , e281–e290 (2024)
2024
-
[12]
Guo, L. L. et al. A multi-center study on the adaptability of a shared foundation model for electronic health records. npj Digital Med. 7 , 171 (2024)
2024
-
[13]
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616 , 259–265 (2023). 25
2023
-
[14]
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622 , 156–163 (2023)
2023
-
[15]
Yang, X. et al. A large language model for electronic health records. npj Digital Med. 5 , 194 (2022)
2022
-
[16]
Lutsker, G. et al. A foundation model for continuous glucose monitoring data. Nature 650 , 978–986 (2026)
2026
-
[17]
Kim, K. et al. Prediction of LDL cholesterol response to statin using transcriptomic and genetic variation. Genome Biol. 15 , 460 (2014)
2014
-
[18]
Cornelissen, V. A. & Smart, N. A. Exercise training for blood pressure: a systematic review and meta-analysis. J. Am. Heart Assoc. 2 , e004473 (2013)
2013
-
[19]
Soenksen, L. R. et al. Integrated multimodal artificial intelligence framework for healthcare applications. npj Digital Med. 5 , 149 (2022)
2022
-
[20]
N., Falcone, G
Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28 , 1773–1784 (2022)
2022
-
[21]
Tu, T. et al. Towards Generalist Biomedical AI (MedPALM-M). arXiv (2023) doi:10.48550/arxiv.2307.14334
-
[22]
Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 21 , 1470–1480 (2024)
2024
-
[23]
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618 , 616–624 (2023)
2023
-
[24]
Bunne, C. et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell 187 , 7045–7063 (2024)
2024
-
[25]
E., Hupalowska, A
Rood, J. E., Hupalowska, A. & Regev, A. Toward a foundation model of causal cell and tissue biology with a Perturbation Cell and Tissue Atlas. Cell 187 , 4520–4545 (2024)
2024
-
[26]
Cole, E. et al. Foundation models improve perturbation response prediction. BioRxiv (2026) doi:10.64898/2026.02.18.706454
-
[27]
Recurrent world models facilitate policy evolution
Ha, D. & Schmidhuber, J. Recurrent World Models Facilitate Policy Evolution. arXiv (2018) doi:10.48550/arxiv.1809.01999. 26
-
[28]
& Lillicrap, T
Hafner, D., Pasukonis, J., Ba, J. & Lillicrap, T. Mastering diverse control tasks through world models. Nature 640 , 647–653 (2025)
2025
-
[29]
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562 , 203–209 (2018)
2018
-
[30]
Johnson, C. L. et al. National health and nutrition examination survey: analytic guidelines, 1999-2010. Vital Health Stat. 2 1–24 (2013)
1999
-
[31]
Ben-Yacov, O. et al. Personalized Postprandial Glucose Response-Targeting Diet Versus Mediterranean Diet for Glycemic Control in Prediabetes. Diabetes Care 44 , 1980–1991 (2021)
1980
-
[32]
B., Garrison, R
Feinleib, M., Kannel, W. B., Garrison, R. J., McNamara, P. M. & Castelli, W. P. The Framingham Offspring Study. Design and preliminary data. Prev. Med. 4 , 518–525 (1975)
1975
-
[33]
A., Hovingh, G
Preiss, D., Tobert, J. A., Hovingh, G. K. & Reith, C. Lipid-Modifying Agents, From Statins to PCSK9 Inhibitors: JACC Focus Seminar. J. Am. Coll. Cardiol. 75 , 1945–1955 (2020)
1945
-
[34]
Patil, S. P. et al. Treatment of Adult Obstructive Sleep Apnea with Positive Airway Pressure: An American Academy of Sleep Medicine Clinical Practice Guideline. J. Clin. Sleep Med. 15 , 335–343 (2019)
2019
-
[35]
Zhang, A. et al. A multimodal and temporal foundation model for virtual patient representations at healthcare system scale. arXiv (2026) doi:10.48550/arxiv.2604.18570. 27 Methods Study Cohort and Data The Human Phenotype Project HealthFormer was developed using data from the Human Phenotype Project (HPP) (Reicher et al., Nat. Med. 2025), a longitudinal de...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.18570 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.