Recognition: 2 theorem links
· Lean TheoremDT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System
Pith reviewed 2026-05-15 01:53 UTC · model grok-4.3
The pith
A transformer model trained on 57 million real-world EHR entries predicts the next disease event with a median AUC of 0.871 across 896 categories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DT-Transformer, trained on 57.1M structured EHR entries over 1.7M patients from Mass General Brigham spanning 11 hospitals and outpatient clinics, achieves a median age- and sex-stratified AUC of 0.871 for next-event prediction across 896 disease categories, with every category exceeding AUC 0.5, in both held-out and prospective validation.
What carries the argument
DT-Transformer, a transformer architecture that processes sequences of structured EHR entries to output the probability of the next disease event.
If this is right
- Health systems can build effective clinical forecasting tools directly from their own large-scale routine data rather than relying on curated research cohorts.
- The model maintains discrimination for all 896 tested disease categories in prospective validation on unseen patients.
- Next-event prediction at this scale supports earlier intervention and resource planning across a broad range of conditions.
- Health-system-scale training provides a practical route to foundation models for real-world clinical forecasting.
Where Pith is reading between the lines
- Similar transformer models could be retrained or fine-tuned on data from other multi-hospital systems to check whether the AUC levels transfer.
- Incorporating additional data modalities such as free-text notes or lab trends might further improve prediction for categories that currently sit near the lower end of the AUC range.
- If performance holds across systems, the same architecture could be applied to related longitudinal tasks such as medication response or complication forecasting.
Load-bearing premise
Structured EHR entries from one health system capture enough of the full complexity and variability of real-world patient trajectories for the model to generalize.
What would settle it
Applying the same model to structured EHR data from an independent health system and finding any disease category with AUC at or below 0.5 would show the claimed performance does not hold outside the training system.
Figures
read the original abstract
Accurate disease trajectory prediction is critical for early intervention, resource allocation, and improving long-term outcomes. While electronic health records (EHRs) provide a rich longitudinal view of patient health in clinical environments, models trained on curated research cohorts may not reflect routine deployment settings, and those trained on single-hospital datasets capture only fragments of each patient's trajectory. This highlights the importance of leveraging large, multi-hospital health systems for training and validation to better reflect real-world clinical complexity. In this work, we develop DT-Transformer, a foundation model trained on 57.1M structured EHR entries over 1.7M patients from Mass General Brigham (MGB), spanning 11 hospitals and a broad network of outpatient clinics. DT-Transformer achieves strong discrimination in both held-out and prospective validation settings. Next-event prediction achieves a median age- and sex-stratified AUC of 0.871 across 896 disease categories, with all categories exceeding AUC 0.5. These results support health system-scale training as a path toward foundation models suited to real-world clinical forecasting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DT-Transformer, a transformer-based foundation model trained on 57.1M structured EHR entries from 1.7M patients across Mass General Brigham's 11 hospitals and outpatient network. It reports next-event prediction performance with a median age- and sex-stratified AUC of 0.871 across 896 disease categories (all >0.5) on both held-out and prospective internal validation splits, arguing that health-system-scale training advances real-world clinical forecasting.
Significance. If the reported discrimination holds under external scrutiny, the work would illustrate the feasibility of training large-scale EHR models on multi-hospital data and could inform deployment of trajectory predictors in routine care. The internal scale (57.1M entries) is a strength, but the foundation-model framing depends on evidence of transferability beyond MGB-specific patterns.
major comments (3)
- [Abstract] Abstract and Results: the headline claim that the model constitutes a path toward foundation models rests on internal MGB-only held-out and prospective splits; no external cohort, multi-center test set, or cross-system evaluation is described, which directly undermines the generalization argument for real-world deployment.
- [Results] Results: no baseline models (e.g., logistic regression using demographics plus prior codes) or ablation studies are reported, so it is impossible to determine whether the transformer architecture contributes incremental value over simpler approaches on the same 896-category task.
- [Methods] Methods: the abstract supplies no architecture details, training procedure, loss function, handling of class imbalance or missing data, or hyperparameter search; these omissions make the AUC numbers impossible to reproduce or stress-test for the central performance claim.
minor comments (2)
- [Abstract] Abstract: the phrase 'strong discrimination' is used without quantifying the exact number of validation patients or the prospective time window, which would clarify the evaluation rigor.
- [Results] The manuscript should include error bars or confidence intervals on the per-category AUCs and report the distribution of AUCs rather than only the median.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate the revisions planned for the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and Results: the headline claim that the model constitutes a path toward foundation models rests on internal MGB-only held-out and prospective splits; no external cohort, multi-center test set, or cross-system evaluation is described, which directly undermines the generalization argument for real-world deployment.
Authors: We agree that external validation on independent systems would strengthen generalization claims. The study demonstrates feasibility using large-scale multi-hospital internal data with held-out and prospective splits. In revision we will tone down foundation-model language in the abstract, add an explicit limitations paragraph noting the absence of external cohorts, and frame results as an internal health-system-scale demonstration rather than broad deployment-ready evidence. revision: partial
-
Referee: [Results] Results: no baseline models (e.g., logistic regression using demographics plus prior codes) or ablation studies are reported, so it is impossible to determine whether the transformer architecture contributes incremental value over simpler approaches on the same 896-category task.
Authors: We accept this criticism. The revised manuscript will add baseline comparisons (logistic regression on demographics plus prior codes, and a simple GRU) plus ablation studies on attention layers and embedding strategies, all evaluated on the identical 896-category task and splits. These results will be inserted into the Results section with appropriate statistical tests. revision: yes
-
Referee: [Methods] Methods: the abstract supplies no architecture details, training procedure, loss function, handling of class imbalance or missing data, or hyperparameter search; these omissions make the AUC numbers impossible to reproduce or stress-test for the central performance claim.
Authors: The full Methods section already contains these specifications (12-layer transformer, 768-dim embeddings, weighted cross-entropy loss, forward-fill plus missingness indicators, and Bayesian hyperparameter optimization). To address the abstract-level gap we will insert a concise methods summary into the abstract and add a reproducibility checklist. No new experiments are required. revision: yes
- Absence of any external validation cohort from an independent health system, which cannot be supplied from the current dataset.
Circularity Check
No significant circularity; empirical AUC from independent held-out and prospective splits
full rationale
The paper trains DT-Transformer on 57.1M MGB EHR entries and reports next-event prediction performance via median age/sex-stratified AUC of 0.871 on held-out and prospective validation sets drawn from the same source but kept separate from training. This is standard non-circular ML evaluation with no equations, self-definitional reductions, fitted-input-as-prediction, or load-bearing self-citations that collapse the reported metric to its inputs by construction. No uniqueness theorems, ansatzes, or renamings of known results are invoked in the provided text to support the core claim.
Axiom & Free-Parameter Ledger
free parameters (1)
- transformer hyperparameters
axioms (1)
- domain assumption Structured EHR entries from one health system capture representative patient trajectories
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DT-Transformer used a Delphi-style generative disease trajectory framework... sinusoidal encoding of continuous age... cross-entropy loss for next-event prediction
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
median age- and sex-stratified AUC of 0.871 across 896 disease categories
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[2]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[3]
M. J. Kearns , title =
-
[4]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[5]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[6]
Suppressed for Anonymity , author=
-
[7]
Desai, Rishi J. and Matheny, Michael E. and Johnson, Kristin and Wen, Tianxi and Yu, Yan and Rogers, John R. and Suo, Yuedong and Wang, Shirley V. and Schneeweiss, Sebastian , title =. npj Digital Medicine , volume =. 2021 , doi =
work page 2021
-
[8]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[9]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
-
[10]
Learning the natural history of human disease with generative transformers , volume =
Shmatko, Artem and Jung, Alexander Wolfgang and Gaurav, Kumar and Brunak, Søren and Mortensen, Laust Hvas and Birney, Ewan and Fitzgerald, Tom and Gerstung, Moritz , month = nov, year =. Learning the natural history of human disease with generative transformers , volume =. Nature , publisher =. doi:10.1038/s41586-025-09529-3 , abstract =
-
[11]
and Was, Jaroslaw and Li, Quanzheng and Bates, David W
Renc, Pawel and Jia, Yugang and Samir, Anthony E. and Was, Jaroslaw and Li, Quanzheng and Bates, David W. and Sitek, Arkadiusz , month = sep, year =. Zero shot health trajectory prediction using transformer , volume =. npj Digital Medicine , publisher =. doi:10.1038/s41746-024-01235-0 , abstract =
-
[12]
IEEE journal of biomedical and health informatics , author =
Bidirectional. IEEE journal of biomedical and health informatics , author =. 2021 , pages =. doi:10.1109/JBHI.2021.3063721 , abstract =
-
[13]
and Zheng, Chunlei and Haue, Amalie D
Placido, Davide and Yuan, Bo and Hjaltelin, Jessica X. and Zheng, Chunlei and Haue, Amalie D. and Chmura, Piotr J. and Yuan, Chen and Kim, Jihye and Umeton, Renato and Antell, Gregory and Chowdhury, Alexander and Franz, Alexandra and Brais, Lauren and Andrews, Elizabeth and Marks, Debora S. and Regev, Aviv and Ayandeh, Siamack and Brophy, Mary T. and Do, ...
-
[14]
American Journal of Epidemiology , author =
Comparison of. American Journal of Epidemiology , author =. 2017 , pages =. doi:10.1093/aje/kwx246 , abstract =
-
[15]
Rasmy, Laila and Xiang, Yang and Xie, Ziqian and Tao, Cui and Zhi, Degui , month = may, year =. Med-. npj Digital Medicine , publisher =. doi:10.1038/s41746-021-00455-y , abstract =
-
[16]
Journal of biomedical informatics , author =. 2023 , pages =. doi:10.1016/j.jbi.2023.104442 , abstract =
-
[17]
Li, Yikuan , year =. Hi-
-
[18]
Kraljevic, Zeljko , year =. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study -
-
[19]
doi:10.48550/arXiv.2301.03150 , abstract =
Steinberg, Ethan and Fries, Jason and Xu, Yizhe and Shah, Nigam , month = dec, year =. doi:10.48550/arXiv.2301.03150 , abstract =
-
[20]
Zhang, Andrew and Ding, Tong and Wagner, Sophia J. and Tian, Caiwei and Lu, Ming Y. and Pettit, Rowland and Lewis, Joshua E. and Misrahi, Alexandre and Mo, Dandan and Le, Long Phi and Mahmood, Faisal , month = apr, year =. A multimodal and temporal foundation model for virtual patient representations at healthcare system scale , url =. doi:10.48550/arXiv....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.18570
-
[21]
Gloeckle, Fabian and Idrissi, Badr Youbi and Rozière, Baptiste and Lopez-Paz, David and Synnaeve, Gabriel , month = apr, year =. Better &. doi:10.48550/arXiv.2404.19737 , abstract =
-
[22]
IEEE Journal of Biomedical and Health Informatics , author =
Hi-. IEEE Journal of Biomedical and Health Informatics , author =. 2023 , keywords =. doi:10.1109/JBHI.2022.3224727 , abstract =
-
[23]
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia , month = aug, year =. Attention. doi:10.48550/arXiv.1706.03762 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762
-
[24]
Johnson, Alistair E. W. and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J. and Hao, Sicheng and Moody, Benjamin and Gow, Brian and Lehman, Li-wei H. and Celi, Leo A. and Mark, Roger G. , month = jan, year =. Scientific Data , publisher =. doi:10.1038/s41597-022-01899-x , abstract =
-
[25]
Nature Communications , publisher =
Yang, Zhichao and Mitra, Avijit and Liu, Weisong and Berlowitz, Dan and Yu, Hong , month = nov, year =. Nature Communications , publisher =. doi:10.1038/s41467-023-43715-z , abstract =
-
[26]
Makarov, Nikita and Bordukova, Maria and Quengdaeng, Papichaya and Garger, Daniel and Rodriguez-Esteban, Raul and Schmich, Fabian and Menden, Michael P. , month = oct, year =. Large language models forecast patient health trajectories enabling digital twins , volume =. npj Digital Medicine , publisher =. doi:10.1038/s41746-025-02004-3 , abstract =
-
[27]
Steinberg, Ethan and Fries, Jason Alan and Xu, Yizhe and Shah, Nigam H , year =. International
-
[28]
Journal of the American Medical Informatics Association : JAMIA , author =
Large language models leverage external knowledge to extend clinical insight beyond language boundaries , volume =. Journal of the American Medical Informatics Association : JAMIA , author =. 2024 , pages =. doi:10.1093/jamia/ocae079 , abstract =
-
[29]
Wu, Jiageng and Gu, Bowen and Zhou, Ren and Xie, Kevin and Snyder, Doug and Jiang, Yixing and Carducci, Valentina and Wyss, Richard and Desai, Rishi J. and Alsentzer, Emily and Celi, Leo Anthony and Rodman, Adam and Schneeweiss, Sebastian and Chen, Jonathan H. and Romero-Brufau, Santiago and Lin, Kueiyu Joshua and Yang, Jie , month = may, year =. doi:10.4...
-
[30]
Journal of Biomedical Informatics , author =
Mining for equitable health:. Journal of Biomedical Informatics , author =. 2023 , keywords =. doi:10.1016/j.jbi.2022.104269 , abstract =
-
[31]
The impact of electronic health record discontinuity on prediction modeling , volume =. PLOS ONE , author =. 2023 , pages =. doi:10.1371/journal.pone.0287985 , abstract =
-
[32]
Liu, Sicen and Wang, Xiaolong and Hou, Yongshuai and Li, Ge and Wang, Hui and Xu, Hui and Xiang, Yang and Tang, Buzhou , month = oct, year =. Multimodal data matters: language model pre-training over structured and unstructured electronic health records , shorttitle =. doi:10.48550/arXiv.2201.10113 , abstract =
-
[33]
Shao, Chong and Snyder, Douglas and Li, Chiran and Gu, Bowen and Ngan, Kerry and Yang, Chun-Ting and Wu, Jiageng and Wyss, Richard and Lin, Kueiyu Joshua and Yang, Jie , year =. Scalable medication extraction and discontinuation identification from electronic health records using large language models , journal =
-
[34]
BMC Medical Research Methodology , author =
Scalable information extraction from free text electronic health records using large language models , volume =. BMC Medical Research Methodology , author =. 2025 , pages =
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.