pith. machine review for the scientific record. sign in

arxiv: 2604.16775 · v1 · submitted 2026-04-18 · 💻 cs.LG · cs.AI

Recognition: unknown

Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords medical event modelstokenizationrepresentation learningMIMIC-IVclinical predictiongenerative modelsfixed-budget benchmarkcode-value fusion
0
0 comments X

The pith

Fused code-value tokenization raises mortality AUROC from 0.891 to 0.915 in fixed-budget medical event models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates the effects of input representation on generative medical models by training many matched transformers under a fixed one-epoch budget on MIMIC-IV data. It tests variations in quantization, value encoding, temporal encoding, and code set remapping across 30 downstream clinical outcomes. Fused code-value tokens produce the largest gains, while event-order and admission-relative RoPE encodings match time-token performance with shorter sequences. CLIF remapping maintains accuracy with a smaller, multi-site-compatible vocabulary. These results indicate that tokenization decisions set hard limits on what models can learn before any training begins.

Core claim

Training 28 matched transformers for one epoch under a shared budget on MIMIC-IV shows that fused code-value tokenization improves mortality AUROC from 0.891 to 0.915, hospital length-of-stay AUROC from 0.763 to 0.788, and mean Spearman rho across 13 regression outcomes from 0.414 to 0.494. Event-order and admission-relative RoPE temporal encodings match or exceed time-token insertion on average while shortening sequences by 11 percent. CLIF remapping preserves downstream performance in the single-site setting while yielding a smaller, clinically interpretable token set. Finer quantization, reference-range anchoring, and soft discretization provide selective benefits, whereas code-normalized

What carries the argument

Fixed-budget benchmark that trains matched one-epoch transformers to isolate representation choices from optimization and architectural confounds.

If this is right

  • Fused code-value tokenization improves mortality and length-of-stay AUROC and regression Spearman rho compared with separate encoding.
  • Event-order only and admission-relative RoPE temporal encodings achieve comparable or higher average performance than time tokens while cutting sequence length by 11%.
  • CLIF remapping preserves task performance while producing a smaller, clinically interpretable token vocabulary suitable for multi-site use.
  • Finer-than-decile quantization, reference-range anchoring, and soft discretization improve selected outcomes, but code-normalized xVal lags the discrete and soft families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be reused on other longitudinal datasets to test whether fused tokenization remains advantageous outside single-site ICU data.
  • Representation design may offer a lower-compute path to performance gains than increasing model size or training epochs in medical settings.
  • Standardized formats like CLIF could enable pooling across hospitals without large performance loss, supporting broader model training.

Load-bearing premise

That one-epoch training of 28 matched transformers under a shared budget sufficiently separates representation effects from optimization dynamics or data leakage in the MIMIC-IV splits.

What would settle it

Retraining the same representation variants for multiple epochs or across different random seeds and data splits, then checking whether the AUROC gaps between fused and unfused tokenization shrink or vanish.

Figures

Figures reproduced from arXiv: 2604.16775 by Bashar Ramadan, Brett K. Beaulieu-Jones, Inhyeok Lee, Luke Solo, Michael C. Burkhart, William F. Parker.

Figure 1
Figure 1. Figure 1: Representation benchmark axes and evaluations. Left: the three represen￾tation axes varied across Experiments 1–3. Right: the 30 clinical outcomes (17 binary, 13 regression). Binary outcomes are scored with AUROC, AUPRC, Brier score, and ECE-15; regression outcomes with Spearman ρ. 2023). These cutoffs mark the boundary between normal and abnormal, not the severe thresholds used for many of our binary outc… view at source ↗
Figure 2
Figure 2. Figure 2: Experiment 1 full outcome sweep. Left: AUROC for all 16 binary outcomes across the 12 granularity/tokenization settings. Right: Spearman ρ for all 13 regression outcomes with the same settings. Circles denote unfused tokenization and squares denote fused tokenization; points are test-set estimates and whiskers are 95% bootstrap confidence intervals. 4.2. Experiment 2: Representation Mechanics Experiment 2 … view at source ↗
Figure 3
Figure 3. Figure 3: Experiment 2 full outcome sweep. Left: AUROC for all 16 binary outcomes across the 12 Experiment 2 configurations. Right: Spearman ρ for all 13 regression outcomes. Marker shape distinguishes event order only, time tokens, and admission￾relative RoPE. Color encodes the value encoder. Points and whiskers are test-set estimates with 95% bootstrap confidence intervals. intervals between discrete none and disc… view at source ↗
Figure 4
Figure 4. Figure 4: Experiment 1 axis comparisons. Mean test-set AUROC (top row) and Spearman ρ (bottom row) across the full 29-outcome sweep, aggregated one representation axis at a time. Left-to-right: fusion, granularity, and reference￾range anchored laboratory binning. Bars show the mean point estimate with mean lower/upper 95% bootstrap confidence bounds. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Experiment 3 outcome comparisons. Left: AUROC for all 16 binary outcomes across the four vocabulary arms. Right: Spearman ρ for all 13 regression outcomes. Points and whiskers are test-set estimates with 95% bootstrap confidence intervals. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Experiment 3 arm comparisons. Mean AUROC (top) and Spearman ρ (bottom) across the full 29-outcome sweep for the four vocabulary arms. Bars show mean point estimates with mean lower/upper 95% bootstrap confidence bounds. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Embedding geometry: shared vs. code-specific centile encoders. Left: PCA of the shared unfused centile manifold. Right panels: fused (code-specific) centile embeddings for six measurements. Under fusion, each variable’s tokens form a distinct arc; panel labels show the realized bin count (e.g., potassium collapses to 28 of the nominal 100 centiles). D.2. Linear Decodability of Clinical Boundaries We test w… view at source ↗
Figure 9
Figure 9. Figure 9: Post-hoc analyses. (a) Clinical-boundary probe accuracy from leave-one-out logistic regression on bin-token embeddings. BG denotes the blood-gas assay of glucose and potassium. (b) Hidden-state L2 norms at [NUM] positions for the xVal-TimeTokens and xVal-Affine-TimeTokens models, shown for layers 0 and 7. Red points mark positions with |z| < 0.5, blue points mark the remaining numeric positions, and the bl… view at source ↗
Figure 10
Figure 10. Figure 10: Token-length distributions (untruncated token lists). Length histograms computed from the Parquet tokens column (per admission), shown for unfused vs. fused tokenization and for both full timelines and first-24h cuts. Dashed vertical lines mark 1024/2048/4096 tokens. For readability under heavy tails, the x-axis is capped at 6000 tokens and all mass beyond 6000 is aggregated into an overflow bin; a CDF ov… view at source ↗
Figure 11
Figure 11. Figure 11: Token-length distributions: inserting time tokens vs. no-time-token to￾kenization. Same layout as [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗
read the original abstract

Every prediction from a generative medical event model is bounded by how clinical events are tokenized, yet input representation is rarely isolated from other system and architectural choices. We evaluate how representation decisions affect downstream prediction after a shared one-epoch pretraining budget. We train 28 matched transformers on MIMIC-IV and evaluate them on 30 clinical outcomes in three experiments: (1) quantization granularity, reference-range anchoring, and code-value fusion; (2) value encoding (hard bins, soft discretization, code-normalized xVal) crossed with temporal encoding (event order, time tokens, admission-relative RoPE); and (3) native MIMIC laboratory/vital codes versus the Common Longitudinal ICU Format (CLIF)-remapped laboratory/vital codes with compression-preserving perturbation arms. In Experiment 1, fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 (BH-adjusted p < 0.001), hospital length-of-stay AUROC from 0.763 to 0.788 (BH-adjusted p < 0.001), and, for the decile fused-vs-unfused comparison, mean regression Spearman rho across the 13 regression outcomes from 0.414 to 0.494. Across the three temporal encodings, event order only and admission-relative RoPE match or exceed inserting time tokens on average while shortening sequences by 11%. CLIF remapping preserves downstream performance in our single-site setting while yielding a smaller, clinically interpretable token set compatible with multi-site use. Finer-than-decile quantization, reference-range anchoring, and soft discretization help in selective outcomes, while code-normalized xVal remains well below the discrete and soft families, consistent with near-median suppression that persists after the affine variant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that input representation choices in generative medical event models can be isolated and benchmarked under a fixed one-epoch pretraining budget. By training 28 matched transformers on MIMIC-IV and evaluating on 30 clinical outcomes, it reports that fused code-value tokenization yields AUROC gains for mortality (0.891 to 0.915) and hospital length-of-stay (0.763 to 0.788) with BH-adjusted p < 0.001, plus improved mean Spearman rho (0.414 to 0.494) across 13 regression tasks; additional experiments examine quantization granularity, value encodings, temporal encodings (showing event order and admission-relative RoPE competitive with time tokens while shortening sequences), and CLIF remapping of codes.

Significance. If the central attribution to representation holds, the work supplies a practical, budget-controlled benchmark for tokenization decisions in clinical sequence models. Strengths include the matched-model design across multiple outcomes, use of BH-adjusted tests, and concrete recommendations (e.g., fusion and CLIF compatibility) that could guide practitioners without increasing compute. The empirical focus on held-out performance rather than theoretical claims makes the results directly actionable if confounding factors are ruled out.

major comments (2)
  1. [Experiment 1 / Methods] The one-epoch shared-budget protocol does not isolate representation effects from optimization dynamics. Different tokenizations change vocabulary size, sequence length (fused tokens shorten sequences), and input statistics, altering per-step gradient magnitudes and convergence speed. Without loss curves, multi-epoch ablations, or seed-wise variance reported for the 28 models, the AUROC gains (mortality 0.891→0.915, LOS 0.763→0.788) and rho improvement cannot be confidently attributed to representation quality rather than faster convergence under the fixed budget. This is load-bearing for the headline claims in Experiment 1.
  2. [Data and Splits] MIMIC-IV patient splits are not described as strictly temporal. This raises the possibility that representation-specific leakage (e.g., via code-value fusion altering which events appear in training vs. test) interacts with the single-epoch regime, potentially inflating the reported gains. A temporal split or explicit leakage audit would be needed to support the cross-representation comparisons.
minor comments (2)
  1. [Abstract / Results] The abstract and results should explicitly state how the mean Spearman rho is aggregated across the 13 regression outcomes and whether the decile fused-vs-unfused comparison uses the same patient cohort as the AUROC tasks.
  2. [Statistical Analysis] Provide the exact number of comparisons underlying the BH adjustment and confirm that all 30 outcomes were included in the correction; this would strengthen interpretation of the p < 0.001 statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, agreeing that further details on optimization and splits will improve clarity and attribution. Revisions will be made accordingly.

read point-by-point responses
  1. Referee: [Experiment 1 / Methods] The one-epoch shared-budget protocol does not isolate representation effects from optimization dynamics. Different tokenizations change vocabulary size, sequence length (fused tokens shorten sequences), and input statistics, altering per-step gradient magnitudes and convergence speed. Without loss curves, multi-epoch ablations, or seed-wise variance reported for the 28 models, the AUROC gains (mortality 0.891→0.915, LOS 0.763→0.788) and rho improvement cannot be confidently attributed to representation quality rather than faster convergence under the fixed budget. This is load-bearing for the headline claims in Experiment 1.

    Authors: We agree that sequence length and vocabulary differences under a one-epoch budget can affect per-step gradients and convergence rates, potentially confounding pure representation effects. The fixed-budget design was selected to mirror practical clinical modeling constraints where multi-epoch training is often limited by compute and data availability. To strengthen the claims, we will add training loss curves for the fused versus unfused comparisons and report standard deviations across multiple seeds for the primary AUROC and Spearman rho results. Full multi-epoch ablations across all 28 models are not feasible given our compute resources, but we will discuss this limitation explicitly. revision: partial

  2. Referee: [Data and Splits] MIMIC-IV patient splits are not described as strictly temporal. This raises the possibility that representation-specific leakage (e.g., via code-value fusion altering which events appear in training vs. test) interacts with the single-epoch regime, potentially inflating the reported gains. A temporal split or explicit leakage audit would be needed to support the cross-representation comparisons.

    Authors: We will revise the methods section to explicitly state that splits are random patient-level partitions with no patient overlap across train, validation, and test sets. Tokenization, including code-value fusion, is applied after splitting and therefore does not differentially alter event membership between sets. We will include a leakage audit confirming absence of patient ID or event-type overlap. While a temporal split could address potential distribution shifts, the single-center MIMIC-IV setting with random splits follows common practice; we will add this as a noted limitation and consider supplementary temporal-split results if space allows. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical benchmark of tokenization variants under fixed training budget

full rationale

The paper reports results from training 28 matched transformers on MIMIC-IV under a shared one-epoch budget and measuring held-out AUROC/Spearman performance across 30 clinical outcomes for different tokenization schemes (fused code-value, quantization, temporal encodings, CLIF remapping). No equations, derivations, or first-principles claims appear; performance differences are presented as direct experimental measurements rather than predictions derived from fitted parameters or self-referential definitions. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are invoked to justify core results. The evaluation is self-contained against external benchmarks (held-out MIMIC-IV splits) with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on the representativeness of MIMIC-IV for medical events and the validity of AUROC/Spearman metrics as proxies for clinical utility; no new entities are postulated.

axioms (1)
  • domain assumption MIMIC-IV single-site data is representative for testing representation effects that generalize to other settings
    All experiments and conclusions are drawn exclusively from this dataset without external validation cohorts.

pith-pipeline@v0.9.0 · 5647 in / 1372 out tokens · 70287 ms · 2026-05-10T07:19:17.001733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    Rethinking tokenization for clinical time series: When less is more, 2025

    Rafi Al Attrach, Rajna Fani, David Restrepo, Yugang Jia, and Peter Sch\" u ffler. Rethinking tokenization for clinical time series: When less is more, 2025. URL https://arxiv.org/abs/2512.05217

  2. [2]

    Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology, 2024. URL https://arxiv.org/abs/2409.20325. Theoretical basis for the Muon optimizer (https://github.com/KellerJordan/Muon)

  3. [3]

    Foundation models for electronic health records: representation dynamics and transferability

    Michael C. Burkhart, Bashar Ramadan, Zewei Liao, Kaveri Chhikara, Juan C. Rojas, William F. Parker, and Brett K. Beaulieu-Jones. Foundation models for electronic health records: representation dynamics and transferability, 2025. URL https://arxiv.org/abs/2504.10422

  4. [4]

    Burkhart, Bashar Ramadan, Luke Solo, William F

    Michael C. Burkhart, Bashar Ramadan, Luke Solo, William F. Parker, and Brett K. Beaulieu-Jones. Quantifying surprise in clinical care: Detecting highly informative events in electronic health records with foundation models. In Pacific Symposium on Biocomputing, volume 31, pages 173--188, 2026. doi:10.1142/9789819824755_0013. URL https://doi.org/10.1142/97...

  5. [5]

    Callahan, Adrianne L

    Tiffany J. Callahan, Adrianne L. Stefanski, Jordan M. Wyrwa, Chenjie Zeng, Anna Ostropolets, Juan M. Banda, William A. Baumgartner, Richard D. Boyce, Elena Casiraghi, Ben D. Coleman, Janine H. Collins, Sara J. Deakyne Davies, James A. Feinstein, Asiyah Y. Lin, Blake Martin, Nicolas A. Matentzoglu, Daniella Meeker, Justin Reese, Jessica Sinclair, Sanya B. ...

  6. [6]

    Stewart, and Jimeng Sun

    Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. Doctor AI : Predicting clinical events via recurrent neural networks, 2016. URL https://arxiv.org/abs/1511.05942

  7. [7]

    Kulas, Andy Schuetz, Walter F

    Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. RETAIN : An interpretable predictive model for healthcare using reverse time attention mechanism, 2017. URL https://arxiv.org/abs/1608.05745

  8. [8]

    CLIMB : Data foundations for large scale multimodal clinical foundation models, 2025

    Wei Dai, Peilin Chen, Malinda Lu, Daniel Li, Haowen Wei, Hejie Cui, and Paul Pu Liang. CLIMB : Data foundations for large scale multimodal clinical foundation models, 2025. URL https://arxiv.org/abs/2503.07667

  9. [9]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. FlashAttention-2 : Faster attention with better parallelism and work partitioning, 2023. URL https://arxiv.org/abs/2307.08691. Published at ICLR 2024

  10. [10]

    arXiv preprint arXiv:2410.13351 , year=

    Vijay Prakash Dwivedi, Viktor Schlegel, Andy T. Liu, Thanh-Tung Nguyen, Abhinav Ramesh Kashyap, Jeng Wei, Wei-Hsian Yin, Stefan Winkler, and Robby T. Tan. Representation learning of structured data for medical foundation models, 2024. URL https://arxiv.org/abs/2410.13351

  11. [11]

    Ehrmamba: Towards generalizable and scalable foundation models for electronic health records.arXiv preprint arXiv:2405.14567, 2024

    Adibvafa Fallahpour, Mahshid Alinoori, Wenqian Ye, Xu Cao, Arash Afkanpour, and Amrit Krishnan. EHRMamba : Towards generalizable and scalable foundation models for electronic health records, 2024. URL https://arxiv.org/abs/2405.14567

  12. [12]

    EHRSHOT : An EHR benchmark for few-shot evaluation of foundation models

    Jason Fries, Nigam Shah, Ethan Steinberg, Rahul Thapa, and Michael Wornow. EHRSHOT : An EHR benchmark for few-shot evaluation of foundation models. In Advances in Neural Information Processing Systems, volume 36, pages 67125--67137. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2023. doi:10.52202/075280-2933. URL https://doi.org/10.522...

  13. [13]

    Gao, Chad Hochberg, Nick Ingraham, William Parker, and CLIF Consortium

    Zewei Liao, Shan Guleria, Kevin Smith, Rachel Baccile, Kaveri Chhikara, Dema Therese, Vaishvik Chaudhari, Michael Craig Burkhart, Brett Beaulieu-Jones, Snigdha Jain, Kathryn Connell, Kevin Buell, Juan Rojas, Patrick Lyons, Siva Bhavani, Catherine A. Gao, Chad Hochberg, Nick Ingraham, William Parker, and CLIF Consortium. MIMIC-IV-Ext-CLIF : MIMIC-IV in the...

  14. [14]

    xVal: A con- tinuous numerical tokenization for scientific language models.arXiv preprint arXiv:2310.02989, 2023

    Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno R\' e galdo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, and Shirley Ho. xVal : A continuous numerical tokenization for scientific language models, 2024. URL https://arxiv.org/abs/2310.02989

  15. [15]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  16. [16]

    Tokenization tradeoffs in structured EHR foundation models, 2026

    Lin Lawrence Guo, Santiago Eduardo Arciniegas, Joseph Jihyung Lee, Adam Paul Yan, George Tomlinson, Jason Fries, and Lillian Sung. Tokenization tradeoffs in structured EHR foundation models, 2026. URL https://arxiv.org/abs/2603.15644

  17. [17]

    Large Language Models are Powerful Electronic Health Record Encoders

    Stefan Hegselmann, Georg von Arnim, Tillmann Rheude, Noel Kronenberg, David Sontag, Gerhard Hindricks, Roland Eils, and Benjamin Wild. Large language models are powerful electronic health record encoders, 2025. URL https://arxiv.org/abs/2502.17403

  18. [18]

    Unifying heterogeneous electronic health records systems via text-based code embedding, 2022

    Kyunghoon Hur, Jiyoung Lee, Jungwoo Oh, Wesley Price, Young-Hak Kim, and Edward Choi. Unifying heterogeneous electronic health records systems via text-based code embedding, 2022. URL https://arxiv.org/abs/2108.03625

  19. [19]

    MIMIC-IV.PhysioNet, October 2024

    Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Brian Gow, Benjamin Moody, Steven Horng, Leo Anthony Celi, and Roger Mark. MIMIC-IV . PhysioNet, 2024. doi:10.13026/KPB9-MT58. URL https://physionet.org/content/mimiciv/3.1/

  20. [20]

    Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, Li-wei H. Lehman, Leo A. Celi, and Roger G. Mark. MIMIC-IV , a freely accessible electronic health record dataset. Scientific Data, 10 0 (1), January 2023. ISSN 2052-4463. doi:10.1038/s41597-022-01899-x. URL...

  21. [21]

    Time2Vec: Learning a Vector Representation of Time

    Seyed Mehran Kazemi, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, Stella Wu, Cathal Smyth, Pascal Poupart, and Marcus Brubaker. Time2Vec : Learning a vector representation of time, 2019. URL https://arxiv.org/abs/1907.05321

  22. [22]

    Krishnan

    Alex Labach, Aslesha Pokhrel, Xiao Shi Huang, Saba Zuberi, Seung Eun Yi, Maksims Volkovs, Tomi Poutanen, and Rahul G. Krishnan. DuETT : Dual event time transformer for electronic health records, 2023. URL https://arxiv.org/abs/2304.13017

  23. [23]

    BEHRT: Transformer for Electronic Health Records

    Yikuan Li, Shishir Rao, José Roberto Ayala Solares, Abdelaali Hassaine, Rema Ramakrishnan, Dexter Canoy, Yajie Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. BEHRT : Transformer for electronic health records. Scientific Reports, 10 0 (1), April 2020. ISSN 2045-2322. doi:10.1038/s41598-020-62922-y. URL https://doi.org/10.1038/s41598-020-62922-y

  24. [24]

    Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks, 2017

    Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks, 2017. URL https://arxiv.org/abs/1706.05764

  25. [25]

    Loftus, Parisa Rashidi, Azra Bihorac, and Benjamin Shickel

    Yingbo Ma, Suraj Kolla, Dhruv Kaliraman, Victoria Nolan, Zhenhong Hu, Ziyuan Guan, Yuanfang Ren, Brooke Armfield, Tezcan Ozrazgat-Baslanti, Tyler J. Loftus, Parisa Rashidi, Azra Bihorac, and Benjamin Shickel. Temporal cross-attention for dynamic embedding and tokenization of multimodal electronic health records, 2024. URL https://arxiv.org/abs/2403.04012

  26. [26]

    Medical event data standard (meds), 2024

    Medical Event Data Standard Community . Medical event data standard (meds), 2024. URL https://medical-event-data-standard.github.io/docs/intro_pages/what_is_MEDS

  27. [27]

    Corrado, and Jeffrey Dean

    Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S. Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings, 2014. URL https://arxiv.org/abs/1312.5650

  28. [28]

    Kalluri, Matthew Spotnitz, RuiJun Chen, Adler Perotte, and Karthik Natarajan

    Chao Pang, Xinzhuo Jiang, Krishna S. Kalluri, Matthew Spotnitz, RuiJun Chen, Adler Perotte, and Karthik Natarajan. CEHR-BERT : Incorporating temporal information from structured EHR data to improve prediction tasks. In Proceedings of Machine Learning for Health, volume 158 of Proceedings of Machine Learning Research, pages 239--260, 2021

  29. [29]

    Burkhart, William F

    Bashar Ramadan, Ming-Chieh Liu, Michael C. Burkhart, William F. Parker, and Brett K. Beaulieu-Jones. Diagnostic codes in ai prediction models and label leakage of same-admission clinical outcomes. JAMA Network Open, 8 0 (12): 0 e2550454, December 2025. ISSN 2574-3805. doi:10.1001/jamanetworkopen.2025.50454. URL https://doi.org/10.1001/jamanetworkopen.2025.50454

  30. [30]

    et al.: Med-BERT: pretrained contextualized embed- dings on large-scale structured electronic health records for disease prediction

    Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. Med-BERT : pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digital Medicine, 4 0 (1), May 2021. ISSN 2398-6352. doi:10.1038/s41746-021-00455-y. URL https://doi.org/10.1038/s41746-021-00455-y

  31. [31]

    Foundation model of electronic medical records for adaptive risk estimation

    Pawel Renc, Michal K Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B A McDermott, Jaroslaw Was, Anthony E Samir, Jonathan W Cunningham, David W Bates, and Arkadiusz Sitek. Foundation model of electronic medical records for adaptive risk estimation. GigaScience, 14, 2025. ISSN 2047-217X. doi:10.1093/gigascience/giaf107....

  32. [32]

    Rojas, Patrick G

    Juan C. Rojas, Patrick G. Lyons, Kaveri Chhikara, Vaishvik Chaudhari, Sivasubramanium V. Bhavani, Muna Nour, Kevin G. Buell, Kevin D. Smith, Catherine A. Gao, Saki Amagai, Chengsheng Mao, Yuan Luo, Anna K. Barker, Mark Nuppnau, Michael Hermsen, Jay L. Koyner, Haley Beck, Rachel Baccile, Zewei Liao, Kyle A. Carey, Brenna Park-Egan, Xuan Han, Alexander C. O...

  33. [33]

    Continuous autoregressive language models

    Chenze Shao, Darren Li, Fandong Meng, and Jie Zhou. Continuous autoregressive language models, 2025. URL https://arxiv.org/abs/2510.27688

  34. [34]

    Fries, Conor K

    Ethan Steinberg, Ken Jung, Jason A. Fries, Conor K. Corbin, Stephen R. Pfohl, and Nigam H. Shah. Language models are an effective representation learning technique for electronic health record data. Journal of Biomedical Informatics, 113: 0 103637, January 2021. ISSN 1532-0464. doi:10.1016/j.jbi.2020.103637. URL https://doi.org/10.1016/j.jbi.2020.103637

  35. [35]

    doi:10.48550/arXiv.2301.03150 , abstract =

    Ethan Steinberg, Jason Fries, Yizhe Xu, and Nigam Shah. MOTOR : A time-to-event foundation model for structured medical records, 2023. URL https://arxiv.org/abs/2301.03150

  36. [36]

    RoFormer: Enhanced Transformer with Rotary Position Embedding,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer : Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, February 2024. ISSN 0925-2312. doi:10.1016/j.neucom.2023.127063. URL https://doi.org/10.1016/j.neucom.2023.127063

  37. [37]

    Multimodal medical code tokenizer, 2025

    Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, and Marinka Zitnik. Multimodal medical code tokenizer, 2025. URL https://arxiv.org/abs/2502.04397

  38. [38]

    Generative medical event models improve with scale.arXiv preprint arXiv:2508.12104, 2025

    Shane Waxler, Paul Blazek, Davis White, Daniel Sneider, Kevin Chung, Mani Nagarathnam, Patrick Williams, Hank Voeller, Karen Wong, Matthew Swanhorst, Sheng Zhang, Naoto Usuyama, Cliff Wong, Tristan Naumann, Hoifung Poon, Andrew Loza, Daniella Meeker, Seth Hain, and Rahul Shah. Generative medical event models improve with scale, 2025. URL https://arxiv.org...

  39. [39]

    Representing numbers in nlp: a survey and a vision

    Avijit Thawani, Jay Pujara, Filip Ilievski, and Pedro Szekely. Representing numbers in nlp: a survey and a vision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644--656. Association for Computational Linguistics, 2021. doi:10.18653/v1/2021.naacl-mai...

  40. [40]

    EHRStruct : A comprehensive benchmark framework for evaluating large language models on structured electronic health record tasks, 2025

    Xiao Yang, Xuejiao Zhao, and Zhiqi Shen. EHRStruct : A comprehensive benchmark framework for evaluating large language models on structured electronic health record tasks, 2025. URL https://arxiv.org/abs/2511.08206

  41. [41]

    u ser, Xinrui Lyu, Martin Faltys, and Gunnar R\

    Hugo Y\` e che, Rita Kuznetsova, Marc Zimmermann, Matthias H\" u ser, Xinrui Lyu, Martin Faltys, and Gunnar R\" a tsch. HiRID-ICU-Benchmark -- a comprehensive machine learning benchmark on high-resolution ICU data, 2022. URL https://arxiv.org/abs/2111.08536

  42. [42]

    Benchmarking foundation models with multimodal public electronic health records

    Kunyu Yu, Rui Yang, Jingchi Liao, Siqi Li, Huitao Li, Irene Li, Yifan Peng, Rishikesan Kamaleswaran, and Nan Liu. Benchmarking foundation models with multimodal public electronic health records. IEEE Journal of Biomedical and Health Informatics, pages 1--12, 2025. ISSN 2168-2208. doi:10.1109/jbhi.2025.3645076. URL https://doi.org/10.1109/jbhi.2025.3645076