pith. sign in

arxiv: 1907.09538 · v1 · pith:OKLLM7I5new · submitted 2019-07-22 · 💻 cs.LG · stat.ML

BEHRT: Transformer for Electronic Health Records

Pith reviewed 2026-05-24 18:02 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords BEHRTtransformerelectronic health recordsdisease predictionmultitask predictionattention mechanismdisease trajectories
0
0 comments X

The pith

BEHRT transformer model improves prediction of 301 disease onsets from electronic health records by 8.0-10.8 percent over prior deep models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEHRT as a deep neural sequence transduction model based on the transformer architecture for electronic health records. The goal is to enable early detection and prediction of diseases through multitask learning on patient histories. Evaluated on data from nearly 1.6 million individuals, BEHRT demonstrates absolute gains of 8.0-10.8% in average precision score over state-of-the-art deep EHR models for predicting the onset of 301 conditions. The model also uses its attention mechanism to provide personalized disease trajectory mapping and can incorporate multiple types of medical data.

Core claim

BEHRT is a transformer-based model for EHR that supports multitask prediction and disease trajectory mapping. Trained on nearly 1.6 million individuals' data, it achieves an absolute improvement of 8.0-10.8% in Average Precision Score compared to existing state-of-the-art deep EHR models for predicting onset of 301 conditions. Its attention mechanism offers a personalised view of disease trajectories, its architecture handles heterogeneous concepts such as diagnosis and medication, and its pre-training yields disease and patient representations that support interpretable predictions.

What carries the argument

BEHRT, a transformer architecture adapted as a sequence transduction model for sequences of electronic health record events.

If this is right

  • Improved accuracy for predicting the onset of 301 medical conditions.
  • Personalized mapping of individual disease trajectories using attention.
  • Incorporation of multiple heterogeneous data concepts to boost prediction accuracy.
  • Generation of disease and patient representations through pre-training for better interpretability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such models could support earlier interventions in healthcare by identifying at-risk patients before symptoms develop.
  • Analysis of the attention patterns might uncover previously unknown relationships in disease progression.
  • The representations learned could be applied to other predictive tasks in medicine.

Load-bearing premise

The performance improvements are attributable to the BEHRT architecture and pre-training rather than to differences in data cleaning, feature construction, or baseline model implementations.

What would settle it

Re-implementing the baseline models using the identical data processing pipeline and patient cohort as BEHRT and comparing the resulting average precision scores.

Figures

Figures reproduced from arXiv: 1907.09538 by Abdelaali Hassaine, Dexter Canoy, Gholamreza Salimi-Khorshidi, Jose Roberto Ayala Solares, Kazem Rahimi, Shishir Rao, Yajie Zhu, Yikuan Li.

Figure 1
Figure 1. Figure 1: Linkage and filtering of CPRD data. This flow lists all the key steps of our data cleaning and linkage procedure. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Preparation of CPRD data for BEHRT. An example patient’s EHR sequence can be seen in (a), which consists [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: BEHRT architecture. Using the artificial data shown in Figure 2, (a) shows how BEHRT sees one’s EHR. In [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Based on the resulting patterns in lower dimension, we can see that diseases that are known to co-occur and/or [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Note that, since BEHRT is bidirectional, the self-attention mechanism captures non-temporal/non-directional [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Disease Self-Attention Analysis. This figure shows the EHR history (shown chronologically, going down [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Disease-wise precision analysis. Each circle in these graphs represents a disease, who and color and size are [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Disease-wise precision comparison for BEHRT, Deepr and RETAIN, all models trained on same dataset and [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Today, despite decades of developments in medicine and the growing interest in precision healthcare, vast majority of diagnoses happen once patients begin to show noticeable signs of illness. Early indication and detection of diseases, however, can provide patients and carers with the chance of early intervention, better disease management, and efficient allocation of healthcare resources. The latest developments in machine learning (more specifically, deep learning) provides a great opportunity to address this unmet need. In this study, we introduce BEHRT: A deep neural sequence transduction model for EHR (electronic health records), capable of multitask prediction and disease trajectory mapping. When trained and evaluated on the data from nearly 1.6 million individuals, BEHRT shows a striking absolute improvement of 8.0-10.8%, in terms of Average Precision Score, compared to the existing state-of-the-art deep EHR models (in terms of average precision, when predicting for the onset of 301 conditions). In addition to its superior prediction power, BEHRT provides a personalised view of disease trajectories through its attention mechanism; its flexible architecture enables it to incorporate multiple heterogeneous concepts (e.g., diagnosis, medication, measurements, and more) to improve the accuracy of its predictions; and its (pre-)training results in disease and patient representations that can help us get a step closer to interpretable predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces BEHRT, a transformer-based sequence model for electronic health records (EHR) that performs multitask prediction of disease onset for 301 conditions. Trained on records from nearly 1.6 million patients, it reports an absolute improvement of 8.0-10.8% in Average Precision Score over prior deep EHR models (DeepCare, RETAIN, etc.), while also enabling interpretable personalized disease trajectories via attention and supporting heterogeneous input types through pre-training.

Significance. If the performance gains can be isolated to the architecture and pre-training, the result would be significant for scaling transformer models to large-scale longitudinal EHR data and for multitask clinical prediction. The scale of the cohort and the attention-based trajectory mapping are positive features; however, the absence of controlled baseline re-implementations reduces the strength of the central empirical claim.

major comments (2)
  1. [§4] §4 (Experiments) and Appendix A: The manuscript describes BEHRT's cohort construction, input representation, and visit aggregation but provides no side-by-side specification of the diagnosis/medication vocabularies, censoring windows, or train/validation/test partitioning applied when re-implementing the baselines (DeepCare, RETAIN, etc.). Without this, the 8.0-10.8% APS improvement cannot be attributed to the transformer architecture rather than differences in data handling.
  2. [§4] §4: No statistical testing, confidence intervals, or multiple-run variance is reported for the APS differences across the 301 conditions. This is required to establish that the reported gains are robust rather than artifacts of a single split or random seed.
minor comments (2)
  1. [Abstract] The abstract states the APS improvement but does not define the exact evaluation protocol (e.g., time-to-event window, positive/negative class construction); this detail should appear in the main text or a dedicated evaluation subsection.
  2. [§3] Notation for the multi-concept embedding (diagnosis, medication, measurements) in §3 is introduced descriptively; an explicit equation or diagram would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the empirical claims.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and Appendix A: The manuscript describes BEHRT's cohort construction, input representation, and visit aggregation but provides no side-by-side specification of the diagnosis/medication vocabularies, censoring windows, or train/validation/test partitioning applied when re-implementing the baselines (DeepCare, RETAIN, etc.). Without this, the 8.0-10.8% APS improvement cannot be attributed to the transformer architecture rather than differences in data handling.

    Authors: We agree that the absence of explicit side-by-side specifications weakens the ability to isolate architectural contributions. The baselines were re-implemented on the identical 1.6M-patient cohort with the same visit aggregation and censoring logic as BEHRT, but the manuscript does not document the exact vocabulary mappings or split indices used for each baseline. In the revision we will add a comparative table in Appendix A listing vocabulary sizes, censoring windows, and train/validation/test partitioning for BEHRT and all re-implemented baselines. revision: yes

  2. Referee: [§4] §4: No statistical testing, confidence intervals, or multiple-run variance is reported for the APS differences across the 301 conditions. This is required to establish that the reported gains are robust rather than artifacts of a single split or random seed.

    Authors: The single split was chosen to preserve maximum training data for the 301-task multitask setting on a large cohort. We acknowledge that variance and significance testing are needed. In the revision we will report APS means and standard deviations over five independent runs with different random seeds, include 95% confidence intervals, and add paired statistical tests (e.g., Wilcoxon signed-rank) between BEHRT and each baseline across the 301 conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims with no derivation chain

full rationale

The paper introduces BEHRT as a transformer-based model and reports empirical APS improvements on a large EHR cohort. No equations, parameter fits, or derivation steps are present that could reduce to self-defined inputs. The performance comparison is an external benchmark result rather than a constructed prediction; no self-citation load-bearing, ansatz smuggling, or uniqueness theorems appear in the abstract or described content. The central claim remains an independent experimental outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated domain assumption that the 1.6 million patient records constitute a representative and unbiased sample for the 301 conditions.

axioms (1)
  • domain assumption The EHR dataset of nearly 1.6 million individuals is representative and free of major selection or recording biases for the 301 conditions studied.
    The performance claim is conditioned on this dataset being suitable for generalization.

pith-pipeline@v0.9.0 · 5805 in / 1353 out tokens · 45943 ms · 2026-05-24T18:02:24.070915+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography

    Diego Ardila, Atilla P Kiraly, Sujeeth Bharadwaj, Bokyung Choi, Joshua J Reicher, Lily Peng, Daniel Tse, Mozziyar Etemadi, Wenxing Ye, Greg Corrado, and David P Naidich. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine, 25(June), 2019

  2. [2]

    Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning

    Ryan Poplin, Avinash V Varadarajan, Katy Blumer, Yun Liu, Michael V McConnell, Greg S Corrado, Lily Peng, and Dale R Webster. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering, 2(3):158–164, 2018

  3. [3]

    human and artificial intelligence

    Eric J Topol. human and artificial intelligence. Nature Medicine, 25(January), 2019

  4. [4]

    A guide to deep learning in healthcare

    Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, V olodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24–29, 2019

  5. [5]

    UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age

    Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, Bette Liu, Paul Matthews, Giok Ong, Jill Pell, Alan Silman, Alan Young, Tim Sprosen, Tim Peakman, and Rory Collins. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of...

  6. [6]

    Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis

    Benjamin Shickel, Patrick Tighe, Azra Bihorac, and Parisa Rashidi. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE Journal of Biomedical and Health Informatics, 22(5):1589–1604, 2018

  7. [7]

    Electronic Public Health Reporting

    O N C Annual Meeting. Electronic Public Health Reporting. None, 2018. Available at: https://www.healthit. gov/sites/default/files/2018-12/ElectronicPublicHealthReporting.pdf

  8. [8]

    Hospitals’ Use of Electronic Health Records Data, 2015-2017

    Sonal Parasrampuria and Jawanna Henry. Hospitals’ Use of Electronic Health Records Data, 2015-2017. ONC Data Brief, No. 46, 2019

  9. [9]

    Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records

    Fatemeh Rahimian, Gholamreza Salimi-Khorshidi, Amir H Payberah, Jenny Tran, Roberto Ayala Solares, Francesca Raimondi, Milad Nazarzadeh, Dexter Canoy, and Kazem Rahimi. Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records. PLoS Medicine, 15(11):1–18, 2018

  10. [10]

    Deep learning for healthcare decision making with EMRs

    Znaonui Liang, Gang Zhang, Jimmy Xiangji Huang, and Qmming Vivian Hu. Deep learning for healthcare decision making with EMRs. Proceedings - 2014 IEEE International Conference on Bioinformatics and Biomedicine, IEEE BIBM 2014, pages 556–559, 2014

  11. [11]

    Learning vector representation of medical objects via EMR-driven nonnegative restricted Boltzmann machines (eNRBM).Journal of Biomedical Informatics, 2015

    Truyen Tran, Tu Dinh Nguyen, Dinh Phung, and Svetha Venkatesh. Learning vector representation of medical objects via EMR-driven nonnegative restricted Boltzmann machines (eNRBM).Journal of Biomedical Informatics, 2015

  12. [12]

    Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records

    Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dudley. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Scientific Reports, 6(May):1–10, 2016

  13. [13]

    Deepr: A Convolutional Net for Medical Records

    Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, and Svetha Venkatesh. Deepr: A Convolutional Net for Medical Records. IEEE Journal of Biomedical and Health Informatics, 21(1):22–30, may 2017

  14. [14]

    Doctor AI: Predicting Clinical Events via Recurrent Neural Networks

    Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. JMLR workshop and conference proceedings, 56:301–318, 2016

  15. [15]

    DeepCare: A deep dynamic memory model for predictive medicine

    Trang Pham, Truyen Tran, Dinh Phung, and Svetha Venkatesh. DeepCare: A deep dynamic memory model for predictive medicine. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9652 LNAI(i):30–41, 2016

  16. [16]

    Kulas, Andy Schuetz, Walter F

    Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. arxiv, 2016

  17. [17]

    Deep Learning for Electronic Health Records: A Comparative Review of Multiple Deep Neural Architectures

    Jose Roberto Ayala Solares, Francesca Elisa Diletta Raimondi, Yajie Zhu, Fatemeh Rahimian, Dexter Canoy, Jenny Tran, Ana Catarina Pinho Gomes, Amir Payberah, Mariagrazia Zottoli, Milad Nazarzadeh, Nathalie Conrad, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. Deep Learning for Electronic Health Records: A Comparative Review of Multiple Deep Neural Archit...

  18. [18]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv, 2018

  19. [19]

    Data Resource Profile: Clinical Practice Research Datalink (CPRD)

    Emily Herrett, Arlene M Gallagher, Krishnan Bhaskaran, Harriet Forbes, Rohini Mathur, Tjeerd Van Staa, and Liam Smeeth. Data Resource Profile: Clinical Practice Research Datalink (CPRD). International Journal of Epidemiology, 44(3):827–836, 2015. 11 A PREPRINT - JULY 24, 2019

  20. [20]

    The uk general practice research database

    T Walley and A Mantgani. The uk general practice research database. The Lancet, 350(9084):1097 – 1099, 1997

  21. [21]

    Usual blood pressure, peripheral arterial disease, and vascular risk: Cohort study of 4.2 million adults

    Connor A Emdin, Simon G Anderson, Thomas Callender, Nathalie Conrad, Gholamreza Salimi-Khorshidi, Hamid Mohseni, Mark Woodward, and Kazem Rahimi. Usual blood pressure, peripheral arterial disease, and vascular risk: Cohort study of 4.2 million adults. BMJ (Online), 2015

  22. [22]

    Emdin, Simon G

    Connor A. Emdin, Simon G. Anderson, Gholamreza Salimi-Khorshidi, Mark Woodward, Stephen MacMahon, Terrence Dwyer, and Kazem Rahimi. Usual blood pressure, atrial fibrillation and vascular risk: Evidence from 4.3 million adults. International Journal of Epidemiology, 2017

  23. [23]

    F. Lee, H. R.S. Patel, and M. Emberton. The ’top 10’ urological procedures: A study of hospital episodes statistics 1998-99. BJU International, 2002

  24. [24]

    Influenza vaccination and risk of hospitalization in patients with heart failure: A self-controlled case series study

    Hamid Mohseni, Amit Kiran, Reza Khorshidi, and Kazem Rahimi. Influenza vaccination and risk of hospitalization in patients with heart failure: A self-controlled case series study. European Heart Journal, 2017

  25. [25]

    Read Codes

    NHS. Read Codes. Available at: https://digital.nhs.uk/services/ terminology-and-classifications/read-codes

  26. [26]

    ICD-10 online versions

    WHO. ICD-10 online versions. Available at https://icd.who.int/browse10/2016/e

  27. [27]

    Articles A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service

    Valerie Kuan, Spiros Denaxas, Arturo Gonzalez-izquierdo, Kenan Direk, Osman Bhatti, Shanaz Husain, Shailen Sutaria, Melanie Hingorani, Dorothea Nitsch, Constantinos A Parisinos, R Thomas Lumbers, Rohini Mathur, Reecha Sofat, Juan P Casas, Ian C K Wong, and Harry Hemingway. Articles A chronological map of 308 physical and mental health conditions from 4 mi...

  28. [28]

    Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Transla- tion

    Kyunghyun Cho. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Transla- tion. arxiv, 2013

  29. [29]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762 [cs], apr 2017

  30. [30]

    On the difficulty of training Recurrent Neural Networks

    Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training Recurrent Neural Networks. arxiv, 2012

  31. [31]

    Multimorbidity: a priority for global health research

    The Academy of Medical Sciences. Multimorbidity: a priority for global health research. The Academy of Medical Sciences, pages 1–127, 2018

  32. [32]

    Evaluation : From Precision , Recall and F-Factor to ROC , Informedness , Markedness & Correlation

    David M W Powers. Evaluation : From Precision , Recall and F-Factor to ROC , Informedness , Markedness & Correlation. arxiv, 2007

  33. [33]

    An introduction to ROC analysis

    Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 2006

  34. [34]

    Recall, precision and average precision

    Mu Zhu. Recall, precision and average precision. Department of Statistics and Actuarial Science, . . ., 2004

  35. [35]

    Practical Bayesian Optimization of Machine Learning Algorithms

    Ryan Snoek, Jasper; Larochelle, Hugo; Adams. Practical Bayesian Optimization of Machine Learning Algorithms. NIPS, 2(12):e540, 2017

  36. [36]

    Evaluating Word Embedding Models : Methods and Experimental Results

    Bin Wang, Student Member, Angela Wang, Fenxiao Chen, Student Member, Yuncheng Wang, and C Jay Kuo. Evaluating Word Embedding Models : Methods and Experimental Results. arxiv, pages 1–13, 2019

  37. [37]

    Visualizing Data using t-SNE

    Laurens Van Der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. JMLR, 9:2579–2605, 2008

  38. [38]

    Visualizing Attention in Transformer-Based Language Representation Models

    Jesse Vig. Visualizing Attention in Transformer-Based Language Representation Models. arxiv, pages 2–7, 2019. 12 A PREPRINT - JULY 24, 2019 A Hyperparameter Tuning We show the hyperparameter tuning results here in the following section. In Table 2, we show the results of the MLM training hyperparameter tuning process. We performed Bayesian Optimization to...