Predicting one-year clinical instability and mortality in heart failure patients using sequence modeling

Annika Rosengren; Christina E. Lundberg; Erik Aerts; Falk Dippel; Helen Sj\"oland; Martin Adiels; Martin Lindgren; Yinan Yu

arxiv: 2511.16839 · v3 · submitted 2025-11-20 · 💻 cs.LG · cs.AI

Predicting one-year clinical instability and mortality in heart failure patients using sequence modeling

Falk Dippel , Yinan Yu , Annika Rosengren , Martin Lindgren , Christina E. Lundberg , Erik Aerts , Martin Adiels , Helen Sj\"oland This is my paper

Pith reviewed 2026-05-17 20:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords heart failuresequence modelingelectronic health recordsclinical predictionmortalityrehospitalizationrisk stratificationmachine learning

0 comments

The pith

Sequence models on routine EHR data predict one-year clinical instability and mortality in heart failure patients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops and tests sequence models that turn structured electronic health record data into tokenized patient sequences to forecast three one-year outcomes after heart failure hospitalization: clinical instability, mortality after the initial diagnosis, and mortality after the latest hospitalization. A modular framework handles tokenization, temporal encoding, and model training, with autoregressive next-token prediction proving most effective in short contexts. The strongest model reaches AUPRCs of 0.555, 0.582, and 0.854 respectively, shows good calibration, and the joint risk scores separate patients into four actionable care pathways ranging from routine follow-up to intensive support. This matters for discharge planning because it extracts usable risk signals directly from data already collected in ordinary hospital care.

Core claim

In a Swedish cohort of 42,820 heart failure patients, autoregressive sequence models trained on tokenized EHR sequences (diagnoses, labs, medications, procedures, and vital signs) achieve AUPRCs of 0.555 for clinical instability after initial diagnosis, 0.582 for mortality after initial diagnosis, and 0.854 for mortality after latest hospitalization, with robust calibration; combining the instability and mortality predictions partitions patients into four distinct post-discharge care pathways that support individualized decisions.

What carries the argument

A modular three-component framework that converts structured EHRs into patient sequences by choosing tokenization strategies, temporal representations, and model configurations, trained primarily with autoregressive next-token prediction.

If this is right

Tiny Llama and Mamba configurations surpass larger conventional baselines on these tasks.
Strong performance persists even when clinical concepts or training data are restricted.
The four care pathways range from standard primary care to intensive home care.
Routine hospital data alone can support post-discharge risk stratification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sequence framework could be applied to outcome prediction in other chronic conditions that generate longitudinal EHR sequences.
Testing on multi-national or multi-center datasets would clarify whether the learned temporal risk patterns are specific to the Swedish recording practices.
Embedding the four pathway assignments into discharge planning software could automate initial triage without new data collection.

Load-bearing premise

The Swedish single-cohort EHR sequences contain all clinically relevant temporal patterns and the chosen tokenization and temporal representations do not systematically omit key risk factors that would appear in other health systems.

What would settle it

Re-training and evaluating the identical models on an independent heart failure EHR cohort from a different country or health system and obtaining markedly lower AUPRC values on the same three tasks would show the temporal patterns do not transfer.

Figures

Figures reproduced from arXiv: 2511.16839 by Annika Rosengren, Christina E. Lundberg, Erik Aerts, Falk Dippel, Helen Sj\"oland, Martin Adiels, Martin Lindgren, Yinan Yu.

**Figure 1.** Figure 1: Overview of the ablation study following a simplified development process, from source electronic health records (EHRs) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: (a) Overview of the clinical prediction tasks in this study: Given a simplified chronologically sorted patient sequence, separate sequence models were trained to predict at discharge: one-year clinical instability at the initial HF diagnosis in-hospital (trajectory 1), one-year mortality at the initial HF diagnosis in-hospital, and one-year mortality at the time of the latest hospitalization (trajectory 2)… view at source ↗

**Figure 3.** Figure 3: visualizes the AUPRC performance for four different token vocabularies. The vocabularies are sorted in ascending order of unique tokens determined by the number of bins b and hierarchical ICD-10 code level i (Table B.3). Overall, Llama achieves the best discriminative performance (highest AUPRC and AUROC) and best calibration capability (lowest Brier score) across nearly all combinations, although Mamba2 i… view at source ↗

**Figure 4.** Figure 4: Ablation of the context length C and model size evaluated by bootstrapped AUPRC (↑) across three different clinical tasks. Within each C sequence modes are sorted by Tiny, Small, and Medium configuration, and compared to XGBoost’s Default configuration. Gray background highlights common setup shared across all ablations. 4.2. Data scalability analysis All experiments analyzing the scalability of models in… view at source ↗

**Figure 7.** Figure 7: Ablation of different training sizes for Medium-sized sequence models and C = 512 evaluated by bootstrapped AUPRC (↑). Gray background highlights common setup shared across all ablations. [24]. Furthermore, models that are trained on the NTP objective achieve better results compared to models trained on MLM. Given the promising performance of Llama and Mambas compared to other models, we focus our discu… view at source ↗

read the original abstract

Heart failure (HF) discharge planning depends on identifying patients at risk of deterioration or death, yet accurate prediction from routinely collected electronic health records (EHRs) remains challenging. We developed and validated sequence models for three one-year prediction tasks in a Swedish HF cohort (N = 42,820): clinical instability (a rehospitalization phenotype) and mortality after the initial in-hospital HF diagnosis, and mortality after the latest hospitalization. A modular three-component framework transforms structured EHRs into patient sequences by specifying tokenization strategies, temporal representations, and model configurations. Patient data included diagnoses, vital signs, laboratories, medications, and procedures. Autoregressive next-token prediction models consistently outperformed alternative objectives in short-context settings (<= 512 tokens). The best model (Llama) achieved AUPRCs (95% CI) of 0.555 (0.535-0.575), 0.582 (0.558-0.608), and 0.854 (0.842-0.865), with robust calibration. Ablations show Llama and Mamba variants learn efficient patient representations, with tiny configurations surpassing larger conventional baselines, indicating that model size alone does not improve performance. With limited clinical concepts or training data, Llama maintains strong performance, frequently surpassing full-data baselines. Combining clinical instability and mortality predictions defines four distinct care pathways, from standard primary care to intensive home care, supporting patient-centered decisions at discharge. These findings demonstrate accurate risk prediction from routine hospital data, provide actionable development guidance, and support post-discharge risk stratification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a modular sequence modeling framework to predict one-year clinical instability (rehospitalization phenotype) and mortality after initial or latest HF hospitalization in a single Swedish EHR cohort (N=42,820). Autoregressive next-token models (best: Llama) are compared against alternatives, achieving AUPRCs (95% CI) of 0.555 (0.535-0.575), 0.582 (0.558-0.608), and 0.854 (0.842-0.865) with reported calibration; ablations examine model size, data volume, and concept count; combining the two prediction tasks is proposed to define four distinct post-discharge care pathways.

Significance. If the temporal representations prove portable, the work supplies concrete evidence that compact autoregressive models can extract actionable risk signals from routine structured EHR for heart-failure discharge planning. The data-efficiency and small-model ablations are useful practical findings. The explicit four-pathway stratification adds translational framing beyond isolated metrics.

major comments (2)

[Methods] Methods (data processing and cohort description): The manuscript provides insufficient detail on patient-level data splits, temporal bucketing of visits, and imputation or masking of missing vital signs and laboratory values. These choices directly affect the reported AUPRCs and the stability of the four care pathways; without them the performance numbers cannot be confidently reproduced or stress-tested for omitted risk factors.
[Results / Discussion] Results and Discussion: All performance figures and the claim that the combined predictions 'support patient-centered decisions at discharge' rest on a single Swedish registry without external or multi-center validation. Systematic differences in diagnosis granularity, vital-sign sampling frequency, or medication coding would invalidate the downstream pathway stratification; this is load-bearing for the central translational claim.

minor comments (2)

[Abstract] Abstract: The three tasks are clearly stated, but a brief parenthetical reminder of what each AUPRC corresponds to (instability, mortality after index, mortality after latest) would improve immediate readability.
[Figures / Tables] Figure captions: Ensure calibration plots and ablation tables explicitly label the three prediction tasks and report the exact token limits used (≤512).

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments. We address each major comment below, indicating where revisions have been made to improve the manuscript.

read point-by-point responses

Referee: [Methods] Methods (data processing and cohort description): The manuscript provides insufficient detail on patient-level data splits, temporal bucketing of visits, and imputation or masking of missing vital signs and laboratory values. These choices directly affect the reported AUPRCs and the stability of the four care pathways; without them the performance numbers cannot be confidently reproduced or stress-tested for omitted risk factors.

Authors: We agree that greater detail on these processing choices is required for reproducibility. In the revised manuscript we have expanded the Methods section with explicit descriptions of the patient-level splits (70/15/15 train/validation/test with patient-level stratification), the temporal bucketing procedure (events aggregated into fixed 30-day windows prior to tokenization), and the missing-data handling (forward-fill imputation for vital signs within a 72-hour window, mean imputation for laboratories accompanied by missingness indicator tokens, plus sensitivity analyses). These additions directly address the concerns about reproducibility and stability of the reported metrics and pathways. revision: yes
Referee: [Results / Discussion] Results and Discussion: All performance figures and the claim that the combined predictions 'support patient-centered decisions at discharge' rest on a single Swedish registry without external or multi-center validation. Systematic differences in diagnosis granularity, vital-sign sampling frequency, or medication coding would invalidate the downstream pathway stratification; this is load-bearing for the central translational claim.

Authors: We acknowledge that external validation would strengthen claims of broader applicability. The present study reports results from a single large Swedish EHR cohort and we have revised the Discussion to more explicitly state this limitation, including the potential effects of coding and sampling differences on pathway stratification. We maintain that the internal performance, calibration, and data-efficiency findings remain valid within the studied population and that the four-pathway framing constitutes a useful proof-of-concept for discharge planning in comparable settings; we do not assert generalizability beyond the cohort without further validation. revision: partial

standing simulated objections not resolved

External or multi-center validation of the reported AUPRCs and care-pathway stratification, which would require access to independent datasets outside the scope of the current study.

Circularity Check

0 steps flagged

Standard held-out evaluation on tokenized EHR sequences yields no circularity

full rationale

The paper describes a modular pipeline that tokenizes structured EHR events (diagnoses, vitals, labs, meds) into sequences, trains autoregressive next-token models (Llama, Mamba variants), and evaluates three one-year prediction tasks on held-out patient sequences using AUPRC. No equations, fitted parameters, or self-citations are shown to reduce the reported AUPRC values (0.555/0.582/0.854) or the four care-pathway stratification to quantities defined by the same training inputs. Performance is obtained via conventional train-test splits on the N=42,820 Swedish cohort; the derivation chain therefore remains independent of its own fitted outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about sequence data and the clinical representativeness of a single national cohort; no new physical entities or ad-hoc constants are introduced beyond typical hyperparameter choices.

axioms (1)

domain assumption EHR event sequences contain sufficient temporal signal to predict one-year clinical outcomes
Invoked when converting structured records into patient sequences for autoregressive training

pith-pipeline@v0.9.0 · 5611 in / 1233 out tokens · 59397 ms · 2026-05-17T20:04:40.686002+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A modular three-component framework transforms structured EHRs into patient sequences by specifying tokenization strategies, temporal representations, and model configurations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 4 internal anchors

[1]

Groenewegen, F

A. Groenewegen, F. H. Rutten, A. Mosterd, A. W. Hoes, Epidemiology of heart failure, European journal of heart 12 Table B.6: Ablation of patient’s medical historyH with context length C =512: Truncation includes encounter information only within the latest admission ( t =0) or up to t∈ { 1, 3} years of prior H. Cutoffrefers to the unprocessed H determined...

work page doi:10.1002/ejhf.1858 2020
[2]

V . M. van Deursen, R. Urso, C. Laroche, K. Damman, U. Dahlström, L. Tavazzi, A. P. Maggioni, A. A. V oors, Co-morbidities in patients with heart failure: an analysis of the european heart failure pilot survey, European journal of heart failure 16 (1) (2014) 103–111. URLhttps://doi.org/10.1002/ejhf.30

work page doi:10.1002/ejhf.30 2014
[3]

N. Azad, G. Lemay, Management of chronic heart failure in the older population, Journal of geriatric cardiology: JGC 11 (4) (2014) 329. URL https://doi.org/10.11909/j.issn. 1671-5411.2014.04.008

work page doi:10.11909/j.issn 2014
[4]

Häyrinen, K

K. Häyrinen, K. Saranto, P. Nykänen, Definition, structure, content, use and impacts of electronic health records: a review of the research literature, International journal of medical informatics 77 (5) (2008) 291–304. URL https://doi.org/10.1016/j.ijmedinf.2007. 09.001

work page doi:10.1016/j.ijmedinf.2007 2008
[5]

E. Kim, S. M. Rubinstein, K. T. Nead, A. P. Wojcieszynski, P. E. Gabriel, J. L. Warner, The evolving use of electronic health records (ehr) for research, Seminars in radiation oncology 29 (4) (2019) 354–361. URL https://doi.org/10.1016/j.semradonc. 2019.05.010

work page doi:10.1016/j.semradonc 2019
[6]

Y . Juhn, H. Liu, Artificial intelligence approaches using natural language processing to advance ehr-based clini- cal research, Journal of Allergy and Clinical Immunology 145 (2) (2020) 463–469. URL https://doi.org/10.1016/j.jaci.2019.12. 897

work page doi:10.1016/j.jaci.2019.12 2020
[7]

Steinberg, K

E. Steinberg, K. Jung, J. A. Fries, C. K. Corbin, S. R. Pfohl, N. H. Shah, Language models are an effective representation learning technique for electronic health record data, Journal of biomedical informatics 113 (2021) 103637. URL https://doi.org/10.1016/j.jbi.2020. 103637

work page doi:10.1016/j.jbi.2020 2021
[8]

Wornow, Y

M. Wornow, Y . Xu, R. Thapa, B. Patel, E. Steinberg, S. Fleming, M. A. Pfeffer, J. Fries, N. H. Shah, The shaky foundations of large language models and foundation models for electronic health records, npj digital medicine 6 (1) (2023) 135. URL https://doi.org/10.1038/ s41746-023-00879-8

work page 2023
[9]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). URL https://dl.acm.org/doi/10.5555/3295222. 3295349

work page doi:10.5555/3295222 2017
[10]

Nerella, S

S. Nerella, S. Bandyopadhyay, J. Zhang, M. Contreras, S. Siegel, A. Bumin, B. Silva, J. Sena, B. Shickel, A. Bihorac, et al., Transformers and large language models in healthcare: A review, Artificial intelligence in medicine (2024) 102900. URL https://doi.org/10.1016/j.artmed.2024. 102900

work page doi:10.1016/j.artmed.2024 2024
[11]

K. S. Kalyan, A. Rajasekharan, S. Sangeetha, Ammu: a survey of transformer-based biomedical pretrained 13 language models, Journal of biomedical informatics 126 (2022) 103982. URL https://doi.org/10.1016/j.jbi.2021. 103982

work page doi:10.1016/j.jbi.2021 2022
[12]

Shamshad, S

F. Shamshad, S. Khan, S. W. Zamir, M. H. Khan, M. Hayat, F. S. Khan, H. Fu, Transformers in medical imaging: A survey, Medical image analysis 88 (2023) 102802. URL https://doi.org/10.1016/j.media.2023. 102802

work page doi:10.1016/j.media.2023 2023
[13]

Nguyen, T

P. Nguyen, T. Tran, N. Wickramasinghe, S. Venkatesh, Deepr: a convolutional net for medical records, IEEE journal of biomedical and health informatics 21 (1) (2016) 22–30. URL https://doi.org/10.1109/JBHI.2016. 2633963

work page doi:10.1109/jbhi.2016 2016
[14]

E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, W. Stewart, Retain: An interpretable predictive model for healthcare using reverse time attention mechanism, Ad- vances in neural information processing systems 29 (2016). URL https://dl.acm.org/doi/10.5555/3157382. 3157490

work page doi:10.5555/3157382 2016
[15]

Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994

Y . Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE trans- actions on neural networks 5 (2) (1994) 157–166. URLhttps://doi.org/10.1109/72.279181

work page doi:10.1109/72.279181 1994
[16]

Hochreiter, J

S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780. URL https://doi.org/10.1162/neco.1997.9.8. 1735

work page doi:10.1162/neco.1997.9.8 1997
[17]

S. M. Al-Selwi, M. F. Hassan, S. J. Abdulkadir, A. Muneer, E. H. Sumiea, A. Alqushaibi, M. G. Ragab, Rnn-lstm: From applications to modeling techniques and beyond—systematic review, Journal of King Saud University-Computer and Information Sciences 36 (5) (2024) 102068. URL https://doi.org/10.1016/j.jksuci.2024. 102068

work page doi:10.1016/j.jksuci.2024 2024
[18]

Towards a unified framework for reference retrieval and related work generation

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre- training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for compu- tational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186. URLhttps://doi.org/10.1...

work page doi:10.18653/v1 2019
[19]

Y . Li, S. Rao, J. R. A. Solares, A. Hassaine, R. Ramakrish- nan, D. Canoy, Y . Zhu, K. Rahimi, G. Salimi-Khorshidi, Behrt: transformer for electronic health records, Scientific reports 10 (1) (2020) 7155. URL https://doi.org/10.1038/ s41598-020-62922-y

work page 2020
[20]

Y . Meng, W. Speier, M. K. Ong, C. W. Arnold, Bidirec- tional representation learning from transformers using multimodal electronic health record data to predict depression, IEEE journal of biomedical and health informatics 25 (8) (2021) 3121–3129. URL https://doi.org/10.1109/JBHI.2021. 3063721

work page doi:10.1109/jbhi.2021 2021
[21]

C. Pang, X. Jiang, K. S. Kalluri, M. Spotnitz, R. Chen, A. Perotte, K. Natarajan, Cehr-bert: Incorporating temporal information from structured ehr data to improve prediction tasks, in: Machine Learning for Health, PMLR, 2021, pp. 239–260. URL https://proceedings.mlr.press/v158/ pang21a.html

work page 2021
[22]

Z. Yang, A. Mitra, W. Liu, D. Berlowitz, H. Yu, Trans- formehr: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records, Nature communications 14 (1) (2023) 7857. URL https://doi.org/10.1038/ s41467-023-43715-z

work page 2023
[23]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023). URL https://doi.org/10.48550/arXiv.2302. 13971

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302 2023
[24]

A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, in: COLM, 2023. URL https://doi.org/10.48550/arXiv.2312. 00752

work page doi:10.48550/arxiv.2312 2023
[25]

Fallahpour, M

A. Fallahpour, M. Alinoori, W. Ye, X. Cao, A. Afkanpour, A. Krishnan, Ehrmamba: Towards generalizable and scalable foundation models for electronic health records, in: Proceedings of the 4th Machine Learning for Health Symposium, V ol. 259 of Proceedings of Machine Learning Research, PMLR, 2025, pp. 291–307. URL https://proceedings.mlr.press/v259/ fallahp...

work page 2025
[26]

Wornow, S

M. Wornow, S. Bedi, M. A. F. Hernandez, E. Steinberg, J. A. Fries, C. Ré, S. Koyejo, N. H. Shah, Context clues: Evaluating long context models for clinical prediction tasks on ehrs, in: ICLR, 2025. URL https://openreview.net/forum?id= zg3ec1TdAP

work page 2025
[27]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020). URL https://doi.org/10.48550/arXiv.2001. 08361 14

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001 2001
[28]

H. Qu, L. Ning, R. An, W. Fan, T. Derr, H. Liu, X. Xu, Q. Li, A survey of mamba, arXiv preprint arXiv:2408.01129 (2024). URL https://doi.org/10.48550/arXiv.2408. 01129

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408 2024
[29]

Rasmy, Y

L. Rasmy, Y . Xiang, Z. Xie, C. Tao, D. Zhi, Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction, NPJ digital medicine 4 (1) (2021) 86. URL https://doi.org/10.1038/ s41746-021-00455-y

work page 2021
[30]

Schaufelberger, S

M. Schaufelberger, S. Ekestubbe, S. Hultgren, H. Persson, A. Reimstad, M. Schaufelberger, A. Rosengren, Validity of heart failure diagnoses made in 2000–2012 in western sweden, ESC heart failure 7 (1) (2020) 37–46. URLhttps://doi.org/10.1002/ehf2.12519

work page doi:10.1002/ehf2.12519 2000
[31]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

B. Warner, A. Chaffin, B. Clavié, O. Weller, O. Hallström, S. Taghadouini, A. Gallagher, R. Biswas, F. Ladhak, T. Aarsen, G. T. Adams, J. Howard, I. Poli, Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, in: Proceedings of the 63rd Annual Meeting of the Association for C...

work page doi:10.18653/v1/2025 2025
[32]

T. Dao, A. Gu, Transformers are ssms: Generalized mod- els and efficient algorithms through structured state space duality, in: ICML, 2024. URL https://dl.acm.org/doi/10.5555/3692070. 3692469

work page doi:10.5555/3692070 2024
[33]

M. Rupp, O. Peter, T. Pattipaka, Exbehrt: Extended transformer for electronic health records, in: International Workshop on Trustworthy Machine Learning for Health- care, Springer, 2023, pp. 73–84. URL https://doi.org/10.1007/ 978-3-031-39539-0_7

work page 2023
[34]

Antikainen, J

E. Antikainen, J. Linnosmaa, A. Umer, N. Oksala, M. Eskola, M. van Gils, J. Hernesniemi, M. Gabbouj, Transformers for cardiac patient mortality risk prediction from heterogeneous electronic health records, Scientific Reports 13 (1) (2023) 3517. URL https://doi.org/10.1038/ s41598-023-30657-1

work page 2023
[35]

Odgaard, K

M. Odgaard, K. V . Klein, S. M. Thysen, E. Jimenez-Solem, M. Sillesen, M. Nielsen, Core-behrt: A carefully optimized and rigorously evaluated behrt, in: Proceedings of the 9th Machine Learning for Healthcare Conference, V ol. 252 of Proceedings of Machine Learning Research, PMLR, 2024, pp. 1–33. URL https://proceedings.mlr.press/v252/ odgaard24a.html

work page 2024
[36]

Y . Li, M. Mamouei, G. Salimi-Khorshidi, S. Rao, A. Hassaine, D. Canoy, T. Lukasiewicz, K. Rahimi, Hi- behrt: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records, IEEE journal of biomedical and health informatics 27 (2) (2022) 1106–1117. URL https://doi.org/10.1109/JBHI.20...

work page doi:10.1109/jbhi.2022 2022
[37]

Shang, T

J. Shang, T. Ma, C. Xiao, J. Sun, Pre-training of graph aug- mented transformers for medication recommendation, in: International Joint Conference on Artificial Intelligence, 2019. URL https://doi.org/10.24963/ijcai.2019% 2F825

work page doi:10.24963/ijcai.2019 2019
[38]

S. Rao, M. Mamouei, G. Salimi-Khorshidi, Y . Li, R. Ramakrishnan, A. Hassaine, D. Canoy, K. Rahimi, Targeted-behrt: deep learning for observational causal inference on longitudinal electronic health records, IEEE Transactions on Neural Networks and Learning Systems 35 (4) (2022) 5027–5038. URL https://doi.org/10.1109/tnnls.2022. 3183864

work page doi:10.1109/tnnls.2022 2022
[39]

C. Pang, X. Jiang, N. P. Pavinkurve, K. S. Kalluri, E. L. Minto, J. Patterson, L. Zhang, G. Hripcsak, G. Gürsoy, N. Elhadad, et al., Cehr-gpt: Generating electronic health records with chronological patient timelines, arXiv preprint arXiv:2402.04400 (2024). URL https://doi.org/10.48550/arXiv.2402. 04400

work page doi:10.48550/arxiv.2402 2024
[40]

Kraljevic, D

Z. Kraljevic, D. Bean, A. Shek, R. Bendayan, H. Heming- way, J. A. Yeung, A. Deng, A. Balston, J. Ross, E. Idowu, et al., Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study, The Lancet Digital Health 6 (4) (2024) e281–e290. URL https://doi.org/10.1016/S2589-7500...

work page doi:10.1016/s2589-7500(24 2024
[41]

Z. Yang, Z. Dai, Y . Yang, J. Carbonell, R. R. Salakhutdinov, Q. V . Le, Xlnet: Generalized autoregressive pretraining for language understanding, Advances in neural information processing systems 32 (2019). URL https://dl.acm.org/doi/10.5555/3454287. 3454804

work page doi:10.5555/3454287 2019
[42]

arXiv (2024)

R. Waleffe, W. Byeon, D. Riach, B. Norick, V . Korthikanti, T. Dao, A. Gu, A. Hatamizadeh, S. Singh, D. Narayanan, et al., An empirical study of mamba-based language models, arXiv preprint arXiv:2406.07887 (2024). URL https://doi.org/10.48550/arXiv.2406. 07887

work page doi:10.48550/arxiv.2406 2024
[43]

X. Liu, C. Zhang, L. Zhang, Vision mamba: A comprehensive survey and taxonomy, arXiv preprint arXiv:2405.04404 (2024). 15 URL https://doi.org/10.48550/arXiv.2405. 04404

work page doi:10.48550/arxiv.2405 2024
[44]

Darcet, M

T. Darcet, M. Oquab, J. Mairal, P. Bojanowski, Vision transformers need registers, in: ICLR, 2024. URL https://openreview.net/forum?id= 2dnO3LLiJ1

work page 2024
[45]

T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd interna- tional conference on knowledge discovery and data mining, 2016, pp. 785–794. URLhttps://doi.org/10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[46]

Loshchilov, F

I. Loshchilov, F. Hutter, Decoupled weight decay regular- ization, in: ICLR, 2019. URL https://openreview.net/forum?id= Bkg6RiCqY7

work page 2019
[47]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, Y . Liu, Roformer: Enhanced transformer with rotary position embedding, Neurocomputing 568 (2024) 127063. URL https://doi.org/10.1016/j.neucom.2023. 127063

work page doi:10.1016/j.neucom.2023 2024
[48]

GLU Variants Improve Transformer

N. Shazeer, Glu variants improve transformer, arXiv preprint arXiv:2002.05202 (2020). URL https://doi.org/10.48550/arXiv.2002. 05202

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2002 2002
[49]

Xiong, Y

R. Xiong, Y . Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, T. Liu, On layer normalization in the transformer architecture, in: ICML, 2020. URL https://dl.acm.org/doi/10.5555/3524938. 3525913

work page doi:10.5555/3524938 2020
[50]

Zhang, R

B. Zhang, R. Sennrich, Root mean square layer normaliza- tion, Advances in neural information processing systems 32 (2019). URL https://dl.acm.org/doi/abs/10.5555/ 3454287.3455397 Supplementary material S1. Data Table S1 highlights the extracted clinical concepts from EHRs. The eligibility criteria for clinical instability is shown in Table S2. S2. Method...

work page arXiv 2019

[1] [1]

Groenewegen, F

A. Groenewegen, F. H. Rutten, A. Mosterd, A. W. Hoes, Epidemiology of heart failure, European journal of heart 12 Table B.6: Ablation of patient’s medical historyH with context length C =512: Truncation includes encounter information only within the latest admission ( t =0) or up to t∈ { 1, 3} years of prior H. Cutoffrefers to the unprocessed H determined...

work page doi:10.1002/ejhf.1858 2020

[2] [2]

V . M. van Deursen, R. Urso, C. Laroche, K. Damman, U. Dahlström, L. Tavazzi, A. P. Maggioni, A. A. V oors, Co-morbidities in patients with heart failure: an analysis of the european heart failure pilot survey, European journal of heart failure 16 (1) (2014) 103–111. URLhttps://doi.org/10.1002/ejhf.30

work page doi:10.1002/ejhf.30 2014

[3] [3]

N. Azad, G. Lemay, Management of chronic heart failure in the older population, Journal of geriatric cardiology: JGC 11 (4) (2014) 329. URL https://doi.org/10.11909/j.issn. 1671-5411.2014.04.008

work page doi:10.11909/j.issn 2014

[4] [4]

Häyrinen, K

K. Häyrinen, K. Saranto, P. Nykänen, Definition, structure, content, use and impacts of electronic health records: a review of the research literature, International journal of medical informatics 77 (5) (2008) 291–304. URL https://doi.org/10.1016/j.ijmedinf.2007. 09.001

work page doi:10.1016/j.ijmedinf.2007 2008

[5] [5]

E. Kim, S. M. Rubinstein, K. T. Nead, A. P. Wojcieszynski, P. E. Gabriel, J. L. Warner, The evolving use of electronic health records (ehr) for research, Seminars in radiation oncology 29 (4) (2019) 354–361. URL https://doi.org/10.1016/j.semradonc. 2019.05.010

work page doi:10.1016/j.semradonc 2019

[6] [6]

Y . Juhn, H. Liu, Artificial intelligence approaches using natural language processing to advance ehr-based clini- cal research, Journal of Allergy and Clinical Immunology 145 (2) (2020) 463–469. URL https://doi.org/10.1016/j.jaci.2019.12. 897

work page doi:10.1016/j.jaci.2019.12 2020

[7] [7]

Steinberg, K

E. Steinberg, K. Jung, J. A. Fries, C. K. Corbin, S. R. Pfohl, N. H. Shah, Language models are an effective representation learning technique for electronic health record data, Journal of biomedical informatics 113 (2021) 103637. URL https://doi.org/10.1016/j.jbi.2020. 103637

work page doi:10.1016/j.jbi.2020 2021

[8] [8]

Wornow, Y

M. Wornow, Y . Xu, R. Thapa, B. Patel, E. Steinberg, S. Fleming, M. A. Pfeffer, J. Fries, N. H. Shah, The shaky foundations of large language models and foundation models for electronic health records, npj digital medicine 6 (1) (2023) 135. URL https://doi.org/10.1038/ s41746-023-00879-8

work page 2023

[9] [9]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). URL https://dl.acm.org/doi/10.5555/3295222. 3295349

work page doi:10.5555/3295222 2017

[10] [10]

Nerella, S

S. Nerella, S. Bandyopadhyay, J. Zhang, M. Contreras, S. Siegel, A. Bumin, B. Silva, J. Sena, B. Shickel, A. Bihorac, et al., Transformers and large language models in healthcare: A review, Artificial intelligence in medicine (2024) 102900. URL https://doi.org/10.1016/j.artmed.2024. 102900

work page doi:10.1016/j.artmed.2024 2024

[11] [11]

K. S. Kalyan, A. Rajasekharan, S. Sangeetha, Ammu: a survey of transformer-based biomedical pretrained 13 language models, Journal of biomedical informatics 126 (2022) 103982. URL https://doi.org/10.1016/j.jbi.2021. 103982

work page doi:10.1016/j.jbi.2021 2022

[12] [12]

Shamshad, S

F. Shamshad, S. Khan, S. W. Zamir, M. H. Khan, M. Hayat, F. S. Khan, H. Fu, Transformers in medical imaging: A survey, Medical image analysis 88 (2023) 102802. URL https://doi.org/10.1016/j.media.2023. 102802

work page doi:10.1016/j.media.2023 2023

[13] [13]

Nguyen, T

P. Nguyen, T. Tran, N. Wickramasinghe, S. Venkatesh, Deepr: a convolutional net for medical records, IEEE journal of biomedical and health informatics 21 (1) (2016) 22–30. URL https://doi.org/10.1109/JBHI.2016. 2633963

work page doi:10.1109/jbhi.2016 2016

[14] [14]

E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, W. Stewart, Retain: An interpretable predictive model for healthcare using reverse time attention mechanism, Ad- vances in neural information processing systems 29 (2016). URL https://dl.acm.org/doi/10.5555/3157382. 3157490

work page doi:10.5555/3157382 2016

[15] [15]

Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994

Y . Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE trans- actions on neural networks 5 (2) (1994) 157–166. URLhttps://doi.org/10.1109/72.279181

work page doi:10.1109/72.279181 1994

[16] [16]

Hochreiter, J

S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780. URL https://doi.org/10.1162/neco.1997.9.8. 1735

work page doi:10.1162/neco.1997.9.8 1997

[17] [17]

S. M. Al-Selwi, M. F. Hassan, S. J. Abdulkadir, A. Muneer, E. H. Sumiea, A. Alqushaibi, M. G. Ragab, Rnn-lstm: From applications to modeling techniques and beyond—systematic review, Journal of King Saud University-Computer and Information Sciences 36 (5) (2024) 102068. URL https://doi.org/10.1016/j.jksuci.2024. 102068

work page doi:10.1016/j.jksuci.2024 2024

[18] [18]

Towards a unified framework for reference retrieval and related work generation

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre- training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for compu- tational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186. URLhttps://doi.org/10.1...

work page doi:10.18653/v1 2019

[19] [19]

Y . Li, S. Rao, J. R. A. Solares, A. Hassaine, R. Ramakrish- nan, D. Canoy, Y . Zhu, K. Rahimi, G. Salimi-Khorshidi, Behrt: transformer for electronic health records, Scientific reports 10 (1) (2020) 7155. URL https://doi.org/10.1038/ s41598-020-62922-y

work page 2020

[20] [20]

Y . Meng, W. Speier, M. K. Ong, C. W. Arnold, Bidirec- tional representation learning from transformers using multimodal electronic health record data to predict depression, IEEE journal of biomedical and health informatics 25 (8) (2021) 3121–3129. URL https://doi.org/10.1109/JBHI.2021. 3063721

work page doi:10.1109/jbhi.2021 2021

[21] [21]

C. Pang, X. Jiang, K. S. Kalluri, M. Spotnitz, R. Chen, A. Perotte, K. Natarajan, Cehr-bert: Incorporating temporal information from structured ehr data to improve prediction tasks, in: Machine Learning for Health, PMLR, 2021, pp. 239–260. URL https://proceedings.mlr.press/v158/ pang21a.html

work page 2021

[22] [22]

Z. Yang, A. Mitra, W. Liu, D. Berlowitz, H. Yu, Trans- formehr: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records, Nature communications 14 (1) (2023) 7857. URL https://doi.org/10.1038/ s41467-023-43715-z

work page 2023

[23] [23]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023). URL https://doi.org/10.48550/arXiv.2302. 13971

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302 2023

[24] [24]

A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, in: COLM, 2023. URL https://doi.org/10.48550/arXiv.2312. 00752

work page doi:10.48550/arxiv.2312 2023

[25] [25]

Fallahpour, M

A. Fallahpour, M. Alinoori, W. Ye, X. Cao, A. Afkanpour, A. Krishnan, Ehrmamba: Towards generalizable and scalable foundation models for electronic health records, in: Proceedings of the 4th Machine Learning for Health Symposium, V ol. 259 of Proceedings of Machine Learning Research, PMLR, 2025, pp. 291–307. URL https://proceedings.mlr.press/v259/ fallahp...

work page 2025

[26] [26]

Wornow, S

M. Wornow, S. Bedi, M. A. F. Hernandez, E. Steinberg, J. A. Fries, C. Ré, S. Koyejo, N. H. Shah, Context clues: Evaluating long context models for clinical prediction tasks on ehrs, in: ICLR, 2025. URL https://openreview.net/forum?id= zg3ec1TdAP

work page 2025

[27] [27]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020). URL https://doi.org/10.48550/arXiv.2001. 08361 14

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001 2001

[28] [28]

H. Qu, L. Ning, R. An, W. Fan, T. Derr, H. Liu, X. Xu, Q. Li, A survey of mamba, arXiv preprint arXiv:2408.01129 (2024). URL https://doi.org/10.48550/arXiv.2408. 01129

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408 2024

[29] [29]

Rasmy, Y

L. Rasmy, Y . Xiang, Z. Xie, C. Tao, D. Zhi, Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction, NPJ digital medicine 4 (1) (2021) 86. URL https://doi.org/10.1038/ s41746-021-00455-y

work page 2021

[30] [30]

Schaufelberger, S

M. Schaufelberger, S. Ekestubbe, S. Hultgren, H. Persson, A. Reimstad, M. Schaufelberger, A. Rosengren, Validity of heart failure diagnoses made in 2000–2012 in western sweden, ESC heart failure 7 (1) (2020) 37–46. URLhttps://doi.org/10.1002/ehf2.12519

work page doi:10.1002/ehf2.12519 2000

[31] [31]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

B. Warner, A. Chaffin, B. Clavié, O. Weller, O. Hallström, S. Taghadouini, A. Gallagher, R. Biswas, F. Ladhak, T. Aarsen, G. T. Adams, J. Howard, I. Poli, Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, in: Proceedings of the 63rd Annual Meeting of the Association for C...

work page doi:10.18653/v1/2025 2025

[32] [32]

T. Dao, A. Gu, Transformers are ssms: Generalized mod- els and efficient algorithms through structured state space duality, in: ICML, 2024. URL https://dl.acm.org/doi/10.5555/3692070. 3692469

work page doi:10.5555/3692070 2024

[33] [33]

M. Rupp, O. Peter, T. Pattipaka, Exbehrt: Extended transformer for electronic health records, in: International Workshop on Trustworthy Machine Learning for Health- care, Springer, 2023, pp. 73–84. URL https://doi.org/10.1007/ 978-3-031-39539-0_7

work page 2023

[34] [34]

Antikainen, J

E. Antikainen, J. Linnosmaa, A. Umer, N. Oksala, M. Eskola, M. van Gils, J. Hernesniemi, M. Gabbouj, Transformers for cardiac patient mortality risk prediction from heterogeneous electronic health records, Scientific Reports 13 (1) (2023) 3517. URL https://doi.org/10.1038/ s41598-023-30657-1

work page 2023

[35] [35]

Odgaard, K

M. Odgaard, K. V . Klein, S. M. Thysen, E. Jimenez-Solem, M. Sillesen, M. Nielsen, Core-behrt: A carefully optimized and rigorously evaluated behrt, in: Proceedings of the 9th Machine Learning for Healthcare Conference, V ol. 252 of Proceedings of Machine Learning Research, PMLR, 2024, pp. 1–33. URL https://proceedings.mlr.press/v252/ odgaard24a.html

work page 2024

[36] [36]

Y . Li, M. Mamouei, G. Salimi-Khorshidi, S. Rao, A. Hassaine, D. Canoy, T. Lukasiewicz, K. Rahimi, Hi- behrt: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records, IEEE journal of biomedical and health informatics 27 (2) (2022) 1106–1117. URL https://doi.org/10.1109/JBHI.20...

work page doi:10.1109/jbhi.2022 2022

[37] [37]

Shang, T

J. Shang, T. Ma, C. Xiao, J. Sun, Pre-training of graph aug- mented transformers for medication recommendation, in: International Joint Conference on Artificial Intelligence, 2019. URL https://doi.org/10.24963/ijcai.2019% 2F825

work page doi:10.24963/ijcai.2019 2019

[38] [38]

S. Rao, M. Mamouei, G. Salimi-Khorshidi, Y . Li, R. Ramakrishnan, A. Hassaine, D. Canoy, K. Rahimi, Targeted-behrt: deep learning for observational causal inference on longitudinal electronic health records, IEEE Transactions on Neural Networks and Learning Systems 35 (4) (2022) 5027–5038. URL https://doi.org/10.1109/tnnls.2022. 3183864

work page doi:10.1109/tnnls.2022 2022

[39] [39]

C. Pang, X. Jiang, N. P. Pavinkurve, K. S. Kalluri, E. L. Minto, J. Patterson, L. Zhang, G. Hripcsak, G. Gürsoy, N. Elhadad, et al., Cehr-gpt: Generating electronic health records with chronological patient timelines, arXiv preprint arXiv:2402.04400 (2024). URL https://doi.org/10.48550/arXiv.2402. 04400

work page doi:10.48550/arxiv.2402 2024

[40] [40]

Kraljevic, D

Z. Kraljevic, D. Bean, A. Shek, R. Bendayan, H. Heming- way, J. A. Yeung, A. Deng, A. Balston, J. Ross, E. Idowu, et al., Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study, The Lancet Digital Health 6 (4) (2024) e281–e290. URL https://doi.org/10.1016/S2589-7500...

work page doi:10.1016/s2589-7500(24 2024

[41] [41]

Z. Yang, Z. Dai, Y . Yang, J. Carbonell, R. R. Salakhutdinov, Q. V . Le, Xlnet: Generalized autoregressive pretraining for language understanding, Advances in neural information processing systems 32 (2019). URL https://dl.acm.org/doi/10.5555/3454287. 3454804

work page doi:10.5555/3454287 2019

[42] [42]

arXiv (2024)

R. Waleffe, W. Byeon, D. Riach, B. Norick, V . Korthikanti, T. Dao, A. Gu, A. Hatamizadeh, S. Singh, D. Narayanan, et al., An empirical study of mamba-based language models, arXiv preprint arXiv:2406.07887 (2024). URL https://doi.org/10.48550/arXiv.2406. 07887

work page doi:10.48550/arxiv.2406 2024

[43] [43]

X. Liu, C. Zhang, L. Zhang, Vision mamba: A comprehensive survey and taxonomy, arXiv preprint arXiv:2405.04404 (2024). 15 URL https://doi.org/10.48550/arXiv.2405. 04404

work page doi:10.48550/arxiv.2405 2024

[44] [44]

Darcet, M

T. Darcet, M. Oquab, J. Mairal, P. Bojanowski, Vision transformers need registers, in: ICLR, 2024. URL https://openreview.net/forum?id= 2dnO3LLiJ1

work page 2024

[45] [45]

T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd interna- tional conference on knowledge discovery and data mining, 2016, pp. 785–794. URLhttps://doi.org/10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016

[46] [46]

Loshchilov, F

I. Loshchilov, F. Hutter, Decoupled weight decay regular- ization, in: ICLR, 2019. URL https://openreview.net/forum?id= Bkg6RiCqY7

work page 2019

[47] [47]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, Y . Liu, Roformer: Enhanced transformer with rotary position embedding, Neurocomputing 568 (2024) 127063. URL https://doi.org/10.1016/j.neucom.2023. 127063

work page doi:10.1016/j.neucom.2023 2024

[48] [48]

GLU Variants Improve Transformer

N. Shazeer, Glu variants improve transformer, arXiv preprint arXiv:2002.05202 (2020). URL https://doi.org/10.48550/arXiv.2002. 05202

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2002 2002

[49] [49]

Xiong, Y

R. Xiong, Y . Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, T. Liu, On layer normalization in the transformer architecture, in: ICML, 2020. URL https://dl.acm.org/doi/10.5555/3524938. 3525913

work page doi:10.5555/3524938 2020

[50] [50]

Zhang, R

B. Zhang, R. Sennrich, Root mean square layer normaliza- tion, Advances in neural information processing systems 32 (2019). URL https://dl.acm.org/doi/abs/10.5555/ 3454287.3455397 Supplementary material S1. Data Table S1 highlights the extracted clinical concepts from EHRs. The eligibility criteria for clinical instability is shown in Table S2. S2. Method...

work page arXiv 2019