BEHRT: Transformer for Electronic Health Records
Pith reviewed 2026-05-24 18:02 UTC · model grok-4.3
The pith
BEHRT transformer model improves prediction of 301 disease onsets from electronic health records by 8.0-10.8 percent over prior deep models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BEHRT is a transformer-based model for EHR that supports multitask prediction and disease trajectory mapping. Trained on nearly 1.6 million individuals' data, it achieves an absolute improvement of 8.0-10.8% in Average Precision Score compared to existing state-of-the-art deep EHR models for predicting onset of 301 conditions. Its attention mechanism offers a personalised view of disease trajectories, its architecture handles heterogeneous concepts such as diagnosis and medication, and its pre-training yields disease and patient representations that support interpretable predictions.
What carries the argument
BEHRT, a transformer architecture adapted as a sequence transduction model for sequences of electronic health record events.
If this is right
- Improved accuracy for predicting the onset of 301 medical conditions.
- Personalized mapping of individual disease trajectories using attention.
- Incorporation of multiple heterogeneous data concepts to boost prediction accuracy.
- Generation of disease and patient representations through pre-training for better interpretability.
Where Pith is reading between the lines
- Such models could support earlier interventions in healthcare by identifying at-risk patients before symptoms develop.
- Analysis of the attention patterns might uncover previously unknown relationships in disease progression.
- The representations learned could be applied to other predictive tasks in medicine.
Load-bearing premise
The performance improvements are attributable to the BEHRT architecture and pre-training rather than to differences in data cleaning, feature construction, or baseline model implementations.
What would settle it
Re-implementing the baseline models using the identical data processing pipeline and patient cohort as BEHRT and comparing the resulting average precision scores.
Figures
read the original abstract
Today, despite decades of developments in medicine and the growing interest in precision healthcare, vast majority of diagnoses happen once patients begin to show noticeable signs of illness. Early indication and detection of diseases, however, can provide patients and carers with the chance of early intervention, better disease management, and efficient allocation of healthcare resources. The latest developments in machine learning (more specifically, deep learning) provides a great opportunity to address this unmet need. In this study, we introduce BEHRT: A deep neural sequence transduction model for EHR (electronic health records), capable of multitask prediction and disease trajectory mapping. When trained and evaluated on the data from nearly 1.6 million individuals, BEHRT shows a striking absolute improvement of 8.0-10.8%, in terms of Average Precision Score, compared to the existing state-of-the-art deep EHR models (in terms of average precision, when predicting for the onset of 301 conditions). In addition to its superior prediction power, BEHRT provides a personalised view of disease trajectories through its attention mechanism; its flexible architecture enables it to incorporate multiple heterogeneous concepts (e.g., diagnosis, medication, measurements, and more) to improve the accuracy of its predictions; and its (pre-)training results in disease and patient representations that can help us get a step closer to interpretable predictions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BEHRT, a transformer-based sequence model for electronic health records (EHR) that performs multitask prediction of disease onset for 301 conditions. Trained on records from nearly 1.6 million patients, it reports an absolute improvement of 8.0-10.8% in Average Precision Score over prior deep EHR models (DeepCare, RETAIN, etc.), while also enabling interpretable personalized disease trajectories via attention and supporting heterogeneous input types through pre-training.
Significance. If the performance gains can be isolated to the architecture and pre-training, the result would be significant for scaling transformer models to large-scale longitudinal EHR data and for multitask clinical prediction. The scale of the cohort and the attention-based trajectory mapping are positive features; however, the absence of controlled baseline re-implementations reduces the strength of the central empirical claim.
major comments (2)
- [§4] §4 (Experiments) and Appendix A: The manuscript describes BEHRT's cohort construction, input representation, and visit aggregation but provides no side-by-side specification of the diagnosis/medication vocabularies, censoring windows, or train/validation/test partitioning applied when re-implementing the baselines (DeepCare, RETAIN, etc.). Without this, the 8.0-10.8% APS improvement cannot be attributed to the transformer architecture rather than differences in data handling.
- [§4] §4: No statistical testing, confidence intervals, or multiple-run variance is reported for the APS differences across the 301 conditions. This is required to establish that the reported gains are robust rather than artifacts of a single split or random seed.
minor comments (2)
- [Abstract] The abstract states the APS improvement but does not define the exact evaluation protocol (e.g., time-to-event window, positive/negative class construction); this detail should appear in the main text or a dedicated evaluation subsection.
- [§3] Notation for the multi-concept embedding (diagnosis, medication, measurements) in §3 is introduced descriptively; an explicit equation or diagram would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the empirical claims.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and Appendix A: The manuscript describes BEHRT's cohort construction, input representation, and visit aggregation but provides no side-by-side specification of the diagnosis/medication vocabularies, censoring windows, or train/validation/test partitioning applied when re-implementing the baselines (DeepCare, RETAIN, etc.). Without this, the 8.0-10.8% APS improvement cannot be attributed to the transformer architecture rather than differences in data handling.
Authors: We agree that the absence of explicit side-by-side specifications weakens the ability to isolate architectural contributions. The baselines were re-implemented on the identical 1.6M-patient cohort with the same visit aggregation and censoring logic as BEHRT, but the manuscript does not document the exact vocabulary mappings or split indices used for each baseline. In the revision we will add a comparative table in Appendix A listing vocabulary sizes, censoring windows, and train/validation/test partitioning for BEHRT and all re-implemented baselines. revision: yes
-
Referee: [§4] §4: No statistical testing, confidence intervals, or multiple-run variance is reported for the APS differences across the 301 conditions. This is required to establish that the reported gains are robust rather than artifacts of a single split or random seed.
Authors: The single split was chosen to preserve maximum training data for the 301-task multitask setting on a large cohort. We acknowledge that variance and significance testing are needed. In the revision we will report APS means and standard deviations over five independent runs with different random seeds, include 95% confidence intervals, and add paired statistical tests (e.g., Wilcoxon signed-rank) between BEHRT and each baseline across the 301 conditions. revision: yes
Circularity Check
No circularity: empirical performance claims with no derivation chain
full rationale
The paper introduces BEHRT as a transformer-based model and reports empirical APS improvements on a large EHR cohort. No equations, parameter fits, or derivation steps are present that could reduce to self-defined inputs. The performance comparison is an external benchmark result rather than a constructed prediction; no self-citation load-bearing, ansatz smuggling, or uniqueness theorems appear in the abstract or described content. The central claim remains an independent experimental outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The EHR dataset of nearly 1.6 million individuals is representative and free of major selection or recording biases for the 301 conditions studied.
Reference graph
Works this paper leans on
-
[1]
Diego Ardila, Atilla P Kiraly, Sujeeth Bharadwaj, Bokyung Choi, Joshua J Reicher, Lily Peng, Daniel Tse, Mozziyar Etemadi, Wenxing Ye, Greg Corrado, and David P Naidich. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine, 25(June), 2019
work page 2019
-
[2]
Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning
Ryan Poplin, Avinash V Varadarajan, Katy Blumer, Yun Liu, Michael V McConnell, Greg S Corrado, Lily Peng, and Dale R Webster. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering, 2(3):158–164, 2018
work page 2018
-
[3]
human and artificial intelligence
Eric J Topol. human and artificial intelligence. Nature Medicine, 25(January), 2019
work page 2019
-
[4]
A guide to deep learning in healthcare
Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, V olodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24–29, 2019
work page 2019
-
[5]
Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, Bette Liu, Paul Matthews, Giok Ong, Jill Pell, Alan Silman, Alan Young, Tim Sprosen, Tim Peakman, and Rory Collins. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of...
work page 2015
-
[6]
Benjamin Shickel, Patrick Tighe, Azra Bihorac, and Parisa Rashidi. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE Journal of Biomedical and Health Informatics, 22(5):1589–1604, 2018
work page 2018
-
[7]
Electronic Public Health Reporting
O N C Annual Meeting. Electronic Public Health Reporting. None, 2018. Available at: https://www.healthit. gov/sites/default/files/2018-12/ElectronicPublicHealthReporting.pdf
work page 2018
-
[8]
Hospitals’ Use of Electronic Health Records Data, 2015-2017
Sonal Parasrampuria and Jawanna Henry. Hospitals’ Use of Electronic Health Records Data, 2015-2017. ONC Data Brief, No. 46, 2019
work page 2015
-
[9]
Fatemeh Rahimian, Gholamreza Salimi-Khorshidi, Amir H Payberah, Jenny Tran, Roberto Ayala Solares, Francesca Raimondi, Milad Nazarzadeh, Dexter Canoy, and Kazem Rahimi. Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records. PLoS Medicine, 15(11):1–18, 2018
work page 2018
-
[10]
Deep learning for healthcare decision making with EMRs
Znaonui Liang, Gang Zhang, Jimmy Xiangji Huang, and Qmming Vivian Hu. Deep learning for healthcare decision making with EMRs. Proceedings - 2014 IEEE International Conference on Bioinformatics and Biomedicine, IEEE BIBM 2014, pages 556–559, 2014
work page 2014
-
[11]
Truyen Tran, Tu Dinh Nguyen, Dinh Phung, and Svetha Venkatesh. Learning vector representation of medical objects via EMR-driven nonnegative restricted Boltzmann machines (eNRBM).Journal of Biomedical Informatics, 2015
work page 2015
-
[12]
Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dudley. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Scientific Reports, 6(May):1–10, 2016
work page 2016
-
[13]
Deepr: A Convolutional Net for Medical Records
Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, and Svetha Venkatesh. Deepr: A Convolutional Net for Medical Records. IEEE Journal of Biomedical and Health Informatics, 21(1):22–30, may 2017
work page 2017
-
[14]
Doctor AI: Predicting Clinical Events via Recurrent Neural Networks
Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. JMLR workshop and conference proceedings, 56:301–318, 2016
work page 2016
-
[15]
DeepCare: A deep dynamic memory model for predictive medicine
Trang Pham, Truyen Tran, Dinh Phung, and Svetha Venkatesh. DeepCare: A deep dynamic memory model for predictive medicine. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9652 LNAI(i):30–41, 2016
work page 2016
-
[16]
Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. arxiv, 2016
work page 2016
-
[17]
Jose Roberto Ayala Solares, Francesca Elisa Diletta Raimondi, Yajie Zhu, Fatemeh Rahimian, Dexter Canoy, Jenny Tran, Ana Catarina Pinho Gomes, Amir Payberah, Mariagrazia Zottoli, Milad Nazarzadeh, Nathalie Conrad, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. Deep Learning for Electronic Health Records: A Comparative Review of Multiple Deep Neural Archit...
work page 2019
-
[18]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv, 2018
work page 2018
-
[19]
Data Resource Profile: Clinical Practice Research Datalink (CPRD)
Emily Herrett, Arlene M Gallagher, Krishnan Bhaskaran, Harriet Forbes, Rohini Mathur, Tjeerd Van Staa, and Liam Smeeth. Data Resource Profile: Clinical Practice Research Datalink (CPRD). International Journal of Epidemiology, 44(3):827–836, 2015. 11 A PREPRINT - JULY 24, 2019
work page 2015
-
[20]
The uk general practice research database
T Walley and A Mantgani. The uk general practice research database. The Lancet, 350(9084):1097 – 1099, 1997
work page 1997
-
[21]
Connor A Emdin, Simon G Anderson, Thomas Callender, Nathalie Conrad, Gholamreza Salimi-Khorshidi, Hamid Mohseni, Mark Woodward, and Kazem Rahimi. Usual blood pressure, peripheral arterial disease, and vascular risk: Cohort study of 4.2 million adults. BMJ (Online), 2015
work page 2015
-
[22]
Connor A. Emdin, Simon G. Anderson, Gholamreza Salimi-Khorshidi, Mark Woodward, Stephen MacMahon, Terrence Dwyer, and Kazem Rahimi. Usual blood pressure, atrial fibrillation and vascular risk: Evidence from 4.3 million adults. International Journal of Epidemiology, 2017
work page 2017
-
[23]
F. Lee, H. R.S. Patel, and M. Emberton. The ’top 10’ urological procedures: A study of hospital episodes statistics 1998-99. BJU International, 2002
work page 1998
-
[24]
Hamid Mohseni, Amit Kiran, Reza Khorshidi, and Kazem Rahimi. Influenza vaccination and risk of hospitalization in patients with heart failure: A self-controlled case series study. European Heart Journal, 2017
work page 2017
-
[25]
NHS. Read Codes. Available at: https://digital.nhs.uk/services/ terminology-and-classifications/read-codes
-
[26]
WHO. ICD-10 online versions. Available at https://icd.who.int/browse10/2016/e
work page 2016
-
[27]
Valerie Kuan, Spiros Denaxas, Arturo Gonzalez-izquierdo, Kenan Direk, Osman Bhatti, Shanaz Husain, Shailen Sutaria, Melanie Hingorani, Dorothea Nitsch, Constantinos A Parisinos, R Thomas Lumbers, Rohini Mathur, Reecha Sofat, Juan P Casas, Ian C K Wong, and Harry Hemingway. Articles A chronological map of 308 physical and mental health conditions from 4 mi...
work page 2019
-
[28]
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Transla- tion
Kyunghyun Cho. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Transla- tion. arxiv, 2013
work page 2013
-
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762 [cs], apr 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
On the difficulty of training Recurrent Neural Networks
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training Recurrent Neural Networks. arxiv, 2012
work page 2012
-
[31]
Multimorbidity: a priority for global health research
The Academy of Medical Sciences. Multimorbidity: a priority for global health research. The Academy of Medical Sciences, pages 1–127, 2018
work page 2018
-
[32]
Evaluation : From Precision , Recall and F-Factor to ROC , Informedness , Markedness & Correlation
David M W Powers. Evaluation : From Precision , Recall and F-Factor to ROC , Informedness , Markedness & Correlation. arxiv, 2007
work page 2007
-
[33]
An introduction to ROC analysis
Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 2006
work page 2006
-
[34]
Recall, precision and average precision
Mu Zhu. Recall, precision and average precision. Department of Statistics and Actuarial Science, . . ., 2004
work page 2004
-
[35]
Practical Bayesian Optimization of Machine Learning Algorithms
Ryan Snoek, Jasper; Larochelle, Hugo; Adams. Practical Bayesian Optimization of Machine Learning Algorithms. NIPS, 2(12):e540, 2017
work page 2017
-
[36]
Evaluating Word Embedding Models : Methods and Experimental Results
Bin Wang, Student Member, Angela Wang, Fenxiao Chen, Student Member, Yuncheng Wang, and C Jay Kuo. Evaluating Word Embedding Models : Methods and Experimental Results. arxiv, pages 1–13, 2019
work page 2019
-
[37]
Laurens Van Der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. JMLR, 9:2579–2605, 2008
work page 2008
-
[38]
Visualizing Attention in Transformer-Based Language Representation Models
Jesse Vig. Visualizing Attention in Transformer-Based Language Representation Models. arxiv, pages 2–7, 2019. 12 A PREPRINT - JULY 24, 2019 A Hyperparameter Tuning We show the hyperparameter tuning results here in the following section. In Table 2, we show the results of the MLM training hyperparameter tuning process. We performed Bayesian Optimization to...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.