pith. sign in

arxiv: 2511.16839 · v3 · submitted 2025-11-20 · 💻 cs.LG · cs.AI

Predicting one-year clinical instability and mortality in heart failure patients using sequence modeling

Pith reviewed 2026-05-17 20:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords heart failuresequence modelingelectronic health recordsclinical predictionmortalityrehospitalizationrisk stratificationmachine learning
0
0 comments X

The pith

Sequence models on routine EHR data predict one-year clinical instability and mortality in heart failure patients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops and tests sequence models that turn structured electronic health record data into tokenized patient sequences to forecast three one-year outcomes after heart failure hospitalization: clinical instability, mortality after the initial diagnosis, and mortality after the latest hospitalization. A modular framework handles tokenization, temporal encoding, and model training, with autoregressive next-token prediction proving most effective in short contexts. The strongest model reaches AUPRCs of 0.555, 0.582, and 0.854 respectively, shows good calibration, and the joint risk scores separate patients into four actionable care pathways ranging from routine follow-up to intensive support. This matters for discharge planning because it extracts usable risk signals directly from data already collected in ordinary hospital care.

Core claim

In a Swedish cohort of 42,820 heart failure patients, autoregressive sequence models trained on tokenized EHR sequences (diagnoses, labs, medications, procedures, and vital signs) achieve AUPRCs of 0.555 for clinical instability after initial diagnosis, 0.582 for mortality after initial diagnosis, and 0.854 for mortality after latest hospitalization, with robust calibration; combining the instability and mortality predictions partitions patients into four distinct post-discharge care pathways that support individualized decisions.

What carries the argument

A modular three-component framework that converts structured EHRs into patient sequences by choosing tokenization strategies, temporal representations, and model configurations, trained primarily with autoregressive next-token prediction.

If this is right

  • Tiny Llama and Mamba configurations surpass larger conventional baselines on these tasks.
  • Strong performance persists even when clinical concepts or training data are restricted.
  • The four care pathways range from standard primary care to intensive home care.
  • Routine hospital data alone can support post-discharge risk stratification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sequence framework could be applied to outcome prediction in other chronic conditions that generate longitudinal EHR sequences.
  • Testing on multi-national or multi-center datasets would clarify whether the learned temporal risk patterns are specific to the Swedish recording practices.
  • Embedding the four pathway assignments into discharge planning software could automate initial triage without new data collection.

Load-bearing premise

The Swedish single-cohort EHR sequences contain all clinically relevant temporal patterns and the chosen tokenization and temporal representations do not systematically omit key risk factors that would appear in other health systems.

What would settle it

Re-training and evaluating the identical models on an independent heart failure EHR cohort from a different country or health system and obtaining markedly lower AUPRC values on the same three tasks would show the temporal patterns do not transfer.

Figures

Figures reproduced from arXiv: 2511.16839 by Annika Rosengren, Christina E. Lundberg, Erik Aerts, Falk Dippel, Helen Sj\"oland, Martin Adiels, Martin Lindgren, Yinan Yu.

Figure 1
Figure 1. Figure 1: Overview of the ablation study following a simplified development process, from source electronic health records (EHRs) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Overview of the clinical prediction tasks in this study: Given a simplified chronologically sorted patient sequence, separate sequence models were trained to predict at discharge: one-year clinical instability at the initial HF diagnosis in-hospital (trajectory 1), one-year mortality at the initial HF diagnosis in-hospital, and one-year mortality at the time of the latest hospitalization (trajectory 2)… view at source ↗
Figure 3
Figure 3. Figure 3: visualizes the AUPRC performance for four different token vocabularies. The vocabularies are sorted in ascending order of unique tokens determined by the number of bins b and hierarchical ICD-10 code level i (Table B.3). Overall, Llama achieves the best discriminative performance (highest AUPRC and AUROC) and best calibration capability (lowest Brier score) across nearly all combinations, although Mamba2 i… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation of the context length C and model size evaluated by bootstrapped AUPRC (↑) across three different clinical tasks. Within each C sequence modes are sorted by Tiny, Small, and Medium configuration, and compared to XGBoost’s Default con￾figuration. Gray background highlights common setup shared across all ablations. 4.2. Data scalability analysis All experiments analyzing the scalability of models in… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation of different training sizes for Medium-sized se￾quence models and C = 512 evaluated by bootstrapped AUPRC (↑). Gray background highlights common setup shared across all ablations. [24]. Furthermore, models that are trained on the NTP objec￾tive achieve better results compared to models trained on MLM. Given the promising performance of Llama and Mambas com￾pared to other models, we focus our discu… view at source ↗
read the original abstract

Heart failure (HF) discharge planning depends on identifying patients at risk of deterioration or death, yet accurate prediction from routinely collected electronic health records (EHRs) remains challenging. We developed and validated sequence models for three one-year prediction tasks in a Swedish HF cohort (N = 42,820): clinical instability (a rehospitalization phenotype) and mortality after the initial in-hospital HF diagnosis, and mortality after the latest hospitalization. A modular three-component framework transforms structured EHRs into patient sequences by specifying tokenization strategies, temporal representations, and model configurations. Patient data included diagnoses, vital signs, laboratories, medications, and procedures. Autoregressive next-token prediction models consistently outperformed alternative objectives in short-context settings (<= 512 tokens). The best model (Llama) achieved AUPRCs (95% CI) of 0.555 (0.535-0.575), 0.582 (0.558-0.608), and 0.854 (0.842-0.865), with robust calibration. Ablations show Llama and Mamba variants learn efficient patient representations, with tiny configurations surpassing larger conventional baselines, indicating that model size alone does not improve performance. With limited clinical concepts or training data, Llama maintains strong performance, frequently surpassing full-data baselines. Combining clinical instability and mortality predictions defines four distinct care pathways, from standard primary care to intensive home care, supporting patient-centered decisions at discharge. These findings demonstrate accurate risk prediction from routine hospital data, provide actionable development guidance, and support post-discharge risk stratification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a modular sequence modeling framework to predict one-year clinical instability (rehospitalization phenotype) and mortality after initial or latest HF hospitalization in a single Swedish EHR cohort (N=42,820). Autoregressive next-token models (best: Llama) are compared against alternatives, achieving AUPRCs (95% CI) of 0.555 (0.535-0.575), 0.582 (0.558-0.608), and 0.854 (0.842-0.865) with reported calibration; ablations examine model size, data volume, and concept count; combining the two prediction tasks is proposed to define four distinct post-discharge care pathways.

Significance. If the temporal representations prove portable, the work supplies concrete evidence that compact autoregressive models can extract actionable risk signals from routine structured EHR for heart-failure discharge planning. The data-efficiency and small-model ablations are useful practical findings. The explicit four-pathway stratification adds translational framing beyond isolated metrics.

major comments (2)
  1. [Methods] Methods (data processing and cohort description): The manuscript provides insufficient detail on patient-level data splits, temporal bucketing of visits, and imputation or masking of missing vital signs and laboratory values. These choices directly affect the reported AUPRCs and the stability of the four care pathways; without them the performance numbers cannot be confidently reproduced or stress-tested for omitted risk factors.
  2. [Results / Discussion] Results and Discussion: All performance figures and the claim that the combined predictions 'support patient-centered decisions at discharge' rest on a single Swedish registry without external or multi-center validation. Systematic differences in diagnosis granularity, vital-sign sampling frequency, or medication coding would invalidate the downstream pathway stratification; this is load-bearing for the central translational claim.
minor comments (2)
  1. [Abstract] Abstract: The three tasks are clearly stated, but a brief parenthetical reminder of what each AUPRC corresponds to (instability, mortality after index, mortality after latest) would improve immediate readability.
  2. [Figures / Tables] Figure captions: Ensure calibration plots and ablation tables explicitly label the three prediction tasks and report the exact token limits used (≤512).

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments. We address each major comment below, indicating where revisions have been made to improve the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods (data processing and cohort description): The manuscript provides insufficient detail on patient-level data splits, temporal bucketing of visits, and imputation or masking of missing vital signs and laboratory values. These choices directly affect the reported AUPRCs and the stability of the four care pathways; without them the performance numbers cannot be confidently reproduced or stress-tested for omitted risk factors.

    Authors: We agree that greater detail on these processing choices is required for reproducibility. In the revised manuscript we have expanded the Methods section with explicit descriptions of the patient-level splits (70/15/15 train/validation/test with patient-level stratification), the temporal bucketing procedure (events aggregated into fixed 30-day windows prior to tokenization), and the missing-data handling (forward-fill imputation for vital signs within a 72-hour window, mean imputation for laboratories accompanied by missingness indicator tokens, plus sensitivity analyses). These additions directly address the concerns about reproducibility and stability of the reported metrics and pathways. revision: yes

  2. Referee: [Results / Discussion] Results and Discussion: All performance figures and the claim that the combined predictions 'support patient-centered decisions at discharge' rest on a single Swedish registry without external or multi-center validation. Systematic differences in diagnosis granularity, vital-sign sampling frequency, or medication coding would invalidate the downstream pathway stratification; this is load-bearing for the central translational claim.

    Authors: We acknowledge that external validation would strengthen claims of broader applicability. The present study reports results from a single large Swedish EHR cohort and we have revised the Discussion to more explicitly state this limitation, including the potential effects of coding and sampling differences on pathway stratification. We maintain that the internal performance, calibration, and data-efficiency findings remain valid within the studied population and that the four-pathway framing constitutes a useful proof-of-concept for discharge planning in comparable settings; we do not assert generalizability beyond the cohort without further validation. revision: partial

standing simulated objections not resolved
  • External or multi-center validation of the reported AUPRCs and care-pathway stratification, which would require access to independent datasets outside the scope of the current study.

Circularity Check

0 steps flagged

Standard held-out evaluation on tokenized EHR sequences yields no circularity

full rationale

The paper describes a modular pipeline that tokenizes structured EHR events (diagnoses, vitals, labs, meds) into sequences, trains autoregressive next-token models (Llama, Mamba variants), and evaluates three one-year prediction tasks on held-out patient sequences using AUPRC. No equations, fitted parameters, or self-citations are shown to reduce the reported AUPRC values (0.555/0.582/0.854) or the four care-pathway stratification to quantities defined by the same training inputs. Performance is obtained via conventional train-test splits on the N=42,820 Swedish cohort; the derivation chain therefore remains independent of its own fitted outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about sequence data and the clinical representativeness of a single national cohort; no new physical entities or ad-hoc constants are introduced beyond typical hyperparameter choices.

axioms (1)
  • domain assumption EHR event sequences contain sufficient temporal signal to predict one-year clinical outcomes
    Invoked when converting structured records into patient sequences for autoregressive training

pith-pipeline@v0.9.0 · 5611 in / 1233 out tokens · 59397 ms · 2026-05-17T20:04:40.686002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 4 internal anchors

  1. [1]

    Groenewegen, F

    A. Groenewegen, F. H. Rutten, A. Mosterd, A. W. Hoes, Epidemiology of heart failure, European journal of heart 12 Table B.6: Ablation of patient’s medical historyH with context length C =512: Truncation includes encounter information only within the latest admission ( t =0) or up to t∈ { 1, 3} years of prior H. Cutoffrefers to the unprocessed H determined...

  2. [2]

    V . M. van Deursen, R. Urso, C. Laroche, K. Damman, U. Dahlström, L. Tavazzi, A. P. Maggioni, A. A. V oors, Co-morbidities in patients with heart failure: an analysis of the european heart failure pilot survey, European journal of heart failure 16 (1) (2014) 103–111. URLhttps://doi.org/10.1002/ejhf.30

  3. [3]

    N. Azad, G. Lemay, Management of chronic heart failure in the older population, Journal of geriatric cardiology: JGC 11 (4) (2014) 329. URL https://doi.org/10.11909/j.issn. 1671-5411.2014.04.008

  4. [4]

    Häyrinen, K

    K. Häyrinen, K. Saranto, P. Nykänen, Definition, structure, content, use and impacts of electronic health records: a review of the research literature, International journal of medical informatics 77 (5) (2008) 291–304. URL https://doi.org/10.1016/j.ijmedinf.2007. 09.001

  5. [5]

    E. Kim, S. M. Rubinstein, K. T. Nead, A. P. Wojcieszynski, P. E. Gabriel, J. L. Warner, The evolving use of electronic health records (ehr) for research, Seminars in radiation oncology 29 (4) (2019) 354–361. URL https://doi.org/10.1016/j.semradonc. 2019.05.010

  6. [6]

    Y . Juhn, H. Liu, Artificial intelligence approaches using natural language processing to advance ehr-based clini- cal research, Journal of Allergy and Clinical Immunology 145 (2) (2020) 463–469. URL https://doi.org/10.1016/j.jaci.2019.12. 897

  7. [7]

    Steinberg, K

    E. Steinberg, K. Jung, J. A. Fries, C. K. Corbin, S. R. Pfohl, N. H. Shah, Language models are an effective representation learning technique for electronic health record data, Journal of biomedical informatics 113 (2021) 103637. URL https://doi.org/10.1016/j.jbi.2020. 103637

  8. [8]

    Wornow, Y

    M. Wornow, Y . Xu, R. Thapa, B. Patel, E. Steinberg, S. Fleming, M. A. Pfeffer, J. Fries, N. H. Shah, The shaky foundations of large language models and foundation models for electronic health records, npj digital medicine 6 (1) (2023) 135. URL https://doi.org/10.1038/ s41746-023-00879-8

  9. [9]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). URL https://dl.acm.org/doi/10.5555/3295222. 3295349

  10. [10]

    Nerella, S

    S. Nerella, S. Bandyopadhyay, J. Zhang, M. Contreras, S. Siegel, A. Bumin, B. Silva, J. Sena, B. Shickel, A. Bihorac, et al., Transformers and large language models in healthcare: A review, Artificial intelligence in medicine (2024) 102900. URL https://doi.org/10.1016/j.artmed.2024. 102900

  11. [11]

    K. S. Kalyan, A. Rajasekharan, S. Sangeetha, Ammu: a survey of transformer-based biomedical pretrained 13 language models, Journal of biomedical informatics 126 (2022) 103982. URL https://doi.org/10.1016/j.jbi.2021. 103982

  12. [12]

    Shamshad, S

    F. Shamshad, S. Khan, S. W. Zamir, M. H. Khan, M. Hayat, F. S. Khan, H. Fu, Transformers in medical imaging: A survey, Medical image analysis 88 (2023) 102802. URL https://doi.org/10.1016/j.media.2023. 102802

  13. [13]

    Nguyen, T

    P. Nguyen, T. Tran, N. Wickramasinghe, S. Venkatesh, Deepr: a convolutional net for medical records, IEEE journal of biomedical and health informatics 21 (1) (2016) 22–30. URL https://doi.org/10.1109/JBHI.2016. 2633963

  14. [14]

    E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, W. Stewart, Retain: An interpretable predictive model for healthcare using reverse time attention mechanism, Ad- vances in neural information processing systems 29 (2016). URL https://dl.acm.org/doi/10.5555/3157382. 3157490

  15. [15]

    Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994

    Y . Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE trans- actions on neural networks 5 (2) (1994) 157–166. URLhttps://doi.org/10.1109/72.279181

  16. [16]

    Hochreiter, J

    S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780. URL https://doi.org/10.1162/neco.1997.9.8. 1735

  17. [17]

    S. M. Al-Selwi, M. F. Hassan, S. J. Abdulkadir, A. Muneer, E. H. Sumiea, A. Alqushaibi, M. G. Ragab, Rnn-lstm: From applications to modeling techniques and beyond—systematic review, Journal of King Saud University-Computer and Information Sciences 36 (5) (2024) 102068. URL https://doi.org/10.1016/j.jksuci.2024. 102068

  18. [18]

    Towards a unified framework for reference retrieval and related work generation

    J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre- training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for compu- tational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186. URLhttps://doi.org/10.1...

  19. [19]

    Y . Li, S. Rao, J. R. A. Solares, A. Hassaine, R. Ramakrish- nan, D. Canoy, Y . Zhu, K. Rahimi, G. Salimi-Khorshidi, Behrt: transformer for electronic health records, Scientific reports 10 (1) (2020) 7155. URL https://doi.org/10.1038/ s41598-020-62922-y

  20. [20]

    Y . Meng, W. Speier, M. K. Ong, C. W. Arnold, Bidirec- tional representation learning from transformers using multimodal electronic health record data to predict depression, IEEE journal of biomedical and health informatics 25 (8) (2021) 3121–3129. URL https://doi.org/10.1109/JBHI.2021. 3063721

  21. [21]

    C. Pang, X. Jiang, K. S. Kalluri, M. Spotnitz, R. Chen, A. Perotte, K. Natarajan, Cehr-bert: Incorporating temporal information from structured ehr data to improve prediction tasks, in: Machine Learning for Health, PMLR, 2021, pp. 239–260. URL https://proceedings.mlr.press/v158/ pang21a.html

  22. [22]

    Z. Yang, A. Mitra, W. Liu, D. Berlowitz, H. Yu, Trans- formehr: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records, Nature communications 14 (1) (2023) 7857. URL https://doi.org/10.1038/ s41467-023-43715-z

  23. [23]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023). URL https://doi.org/10.48550/arXiv.2302. 13971

  24. [24]

    A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, in: COLM, 2023. URL https://doi.org/10.48550/arXiv.2312. 00752

  25. [25]

    Fallahpour, M

    A. Fallahpour, M. Alinoori, W. Ye, X. Cao, A. Afkanpour, A. Krishnan, Ehrmamba: Towards generalizable and scalable foundation models for electronic health records, in: Proceedings of the 4th Machine Learning for Health Symposium, V ol. 259 of Proceedings of Machine Learning Research, PMLR, 2025, pp. 291–307. URL https://proceedings.mlr.press/v259/ fallahp...

  26. [26]

    Wornow, S

    M. Wornow, S. Bedi, M. A. F. Hernandez, E. Steinberg, J. A. Fries, C. Ré, S. Koyejo, N. H. Shah, Context clues: Evaluating long context models for clinical prediction tasks on ehrs, in: ICLR, 2025. URL https://openreview.net/forum?id= zg3ec1TdAP

  27. [27]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020). URL https://doi.org/10.48550/arXiv.2001. 08361 14

  28. [28]

    H. Qu, L. Ning, R. An, W. Fan, T. Derr, H. Liu, X. Xu, Q. Li, A survey of mamba, arXiv preprint arXiv:2408.01129 (2024). URL https://doi.org/10.48550/arXiv.2408. 01129

  29. [29]

    Rasmy, Y

    L. Rasmy, Y . Xiang, Z. Xie, C. Tao, D. Zhi, Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction, NPJ digital medicine 4 (1) (2021) 86. URL https://doi.org/10.1038/ s41746-021-00455-y

  30. [30]

    Schaufelberger, S

    M. Schaufelberger, S. Ekestubbe, S. Hultgren, H. Persson, A. Reimstad, M. Schaufelberger, A. Rosengren, Validity of heart failure diagnoses made in 2000–2012 in western sweden, ESC heart failure 7 (1) (2020) 37–46. URLhttps://doi.org/10.1002/ehf2.12519

  31. [31]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    B. Warner, A. Chaffin, B. Clavié, O. Weller, O. Hallström, S. Taghadouini, A. Gallagher, R. Biswas, F. Ladhak, T. Aarsen, G. T. Adams, J. Howard, I. Poli, Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, in: Proceedings of the 63rd Annual Meeting of the Association for C...

  32. [32]

    T. Dao, A. Gu, Transformers are ssms: Generalized mod- els and efficient algorithms through structured state space duality, in: ICML, 2024. URL https://dl.acm.org/doi/10.5555/3692070. 3692469

  33. [33]

    M. Rupp, O. Peter, T. Pattipaka, Exbehrt: Extended transformer for electronic health records, in: International Workshop on Trustworthy Machine Learning for Health- care, Springer, 2023, pp. 73–84. URL https://doi.org/10.1007/ 978-3-031-39539-0_7

  34. [34]

    Antikainen, J

    E. Antikainen, J. Linnosmaa, A. Umer, N. Oksala, M. Eskola, M. van Gils, J. Hernesniemi, M. Gabbouj, Transformers for cardiac patient mortality risk prediction from heterogeneous electronic health records, Scientific Reports 13 (1) (2023) 3517. URL https://doi.org/10.1038/ s41598-023-30657-1

  35. [35]

    Odgaard, K

    M. Odgaard, K. V . Klein, S. M. Thysen, E. Jimenez-Solem, M. Sillesen, M. Nielsen, Core-behrt: A carefully optimized and rigorously evaluated behrt, in: Proceedings of the 9th Machine Learning for Healthcare Conference, V ol. 252 of Proceedings of Machine Learning Research, PMLR, 2024, pp. 1–33. URL https://proceedings.mlr.press/v252/ odgaard24a.html

  36. [36]

    Y . Li, M. Mamouei, G. Salimi-Khorshidi, S. Rao, A. Hassaine, D. Canoy, T. Lukasiewicz, K. Rahimi, Hi- behrt: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records, IEEE journal of biomedical and health informatics 27 (2) (2022) 1106–1117. URL https://doi.org/10.1109/JBHI.20...

  37. [37]

    Shang, T

    J. Shang, T. Ma, C. Xiao, J. Sun, Pre-training of graph aug- mented transformers for medication recommendation, in: International Joint Conference on Artificial Intelligence, 2019. URL https://doi.org/10.24963/ijcai.2019% 2F825

  38. [38]

    S. Rao, M. Mamouei, G. Salimi-Khorshidi, Y . Li, R. Ramakrishnan, A. Hassaine, D. Canoy, K. Rahimi, Targeted-behrt: deep learning for observational causal inference on longitudinal electronic health records, IEEE Transactions on Neural Networks and Learning Systems 35 (4) (2022) 5027–5038. URL https://doi.org/10.1109/tnnls.2022. 3183864

  39. [39]

    C. Pang, X. Jiang, N. P. Pavinkurve, K. S. Kalluri, E. L. Minto, J. Patterson, L. Zhang, G. Hripcsak, G. Gürsoy, N. Elhadad, et al., Cehr-gpt: Generating electronic health records with chronological patient timelines, arXiv preprint arXiv:2402.04400 (2024). URL https://doi.org/10.48550/arXiv.2402. 04400

  40. [40]

    Kraljevic, D

    Z. Kraljevic, D. Bean, A. Shek, R. Bendayan, H. Heming- way, J. A. Yeung, A. Deng, A. Balston, J. Ross, E. Idowu, et al., Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study, The Lancet Digital Health 6 (4) (2024) e281–e290. URL https://doi.org/10.1016/S2589-7500...

  41. [41]

    Z. Yang, Z. Dai, Y . Yang, J. Carbonell, R. R. Salakhutdinov, Q. V . Le, Xlnet: Generalized autoregressive pretraining for language understanding, Advances in neural information processing systems 32 (2019). URL https://dl.acm.org/doi/10.5555/3454287. 3454804

  42. [42]

    arXiv (2024)

    R. Waleffe, W. Byeon, D. Riach, B. Norick, V . Korthikanti, T. Dao, A. Gu, A. Hatamizadeh, S. Singh, D. Narayanan, et al., An empirical study of mamba-based language models, arXiv preprint arXiv:2406.07887 (2024). URL https://doi.org/10.48550/arXiv.2406. 07887

  43. [43]

    X. Liu, C. Zhang, L. Zhang, Vision mamba: A comprehensive survey and taxonomy, arXiv preprint arXiv:2405.04404 (2024). 15 URL https://doi.org/10.48550/arXiv.2405. 04404

  44. [44]

    Darcet, M

    T. Darcet, M. Oquab, J. Mairal, P. Bojanowski, Vision transformers need registers, in: ICLR, 2024. URL https://openreview.net/forum?id= 2dnO3LLiJ1

  45. [45]

    T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd interna- tional conference on knowledge discovery and data mining, 2016, pp. 785–794. URLhttps://doi.org/10.1145/2939672.2939785

  46. [46]

    Loshchilov, F

    I. Loshchilov, F. Hutter, Decoupled weight decay regular- ization, in: ICLR, 2019. URL https://openreview.net/forum?id= Bkg6RiCqY7

  47. [47]

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, Y . Liu, Roformer: Enhanced transformer with rotary position embedding, Neurocomputing 568 (2024) 127063. URL https://doi.org/10.1016/j.neucom.2023. 127063

  48. [48]

    GLU Variants Improve Transformer

    N. Shazeer, Glu variants improve transformer, arXiv preprint arXiv:2002.05202 (2020). URL https://doi.org/10.48550/arXiv.2002. 05202

  49. [49]

    Xiong, Y

    R. Xiong, Y . Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, T. Liu, On layer normalization in the transformer architecture, in: ICML, 2020. URL https://dl.acm.org/doi/10.5555/3524938. 3525913

  50. [50]

    Zhang, R

    B. Zhang, R. Sennrich, Root mean square layer normaliza- tion, Advances in neural information processing systems 32 (2019). URL https://dl.acm.org/doi/abs/10.5555/ 3454287.3455397 Supplementary material S1. Data Table S1 highlights the extracted clinical concepts from EHRs. The eligibility criteria for clinical instability is shown in Table S2. S2. Method...