A Multimodal and Explainable Machine Learning Approach to Diagnosing Multi-Class Ejection Fraction from Electrocardiograms
Pith reviewed 2026-05-10 09:34 UTC · model grok-4.3
The pith
A multimodal model using ECG timeseries features and EHR data classifies left ventricular ejection fraction into four clinical categories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A multimodal XGBoost model trained on 36,784 ECG-echocardiogram pairs from 30,952 outpatients achieves one-vs-rest AUROCs of 0.95 for severe reduction, 0.92 for moderate, 0.82 for mild, and 0.91 for normal ejection fraction, outperforming single-modality baselines while maintaining performance under temporal validation on 19,966 later ECGs; SHAP attributions identify the most influential ECG and EHR features for each classification.
What carries the argument
The multimodal XGBoost classifier that integrates engineered 12-lead ECG timeseries features with structured EHR variables, using SHAP attributions to explain feature contributions.
Load-bearing premise
Data drawn retrospectively from outpatients at a single U.S. health system represent the populations and clinical workflows where the model would be applied.
What would settle it
Testing the trained model on ECG and EHR pairs collected from a different health system or region and finding substantially lower AUROCs would show the performance does not generalize.
read the original abstract
Left ventricular ejection fraction (LVEF) assessment depends on echocardiography, limiting access in primary care and resource-constrained settings. We developed a multimodal machine-learning framework that combines engineered 12-lead ECG timeseries features with structured EHR variables to classify LVEF into four clinically used strata: normal (>50%), mildly reduced (40-50%), moderately reduced (30-40%), and severely reduced (<30%). To support model explainability, we identified the most influential ECG and EHR features via SHAP attributions. Using retrospective data from Hartford HealthCare, we trained XGBoost models on 36,784 ECG-echocardiogram pairs from 30,952 outpatients and evaluated temporal generalizability on 19,966 ECGs from a subsequent period. The multimodal model achieved one-vs-rest AUROCs of 0.95 (severe), 0.92 (moderate), 0.82 (mild), and 0.91 (normal), outperforming ECG-only and EHR-only baselines, and maintained performance under temporal validation. This work supports ECG-based, multimodal LVEF stratification as a practical screening and triage aid to prioritize confirmatory imaging where resources are limited.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a multimodal XGBoost model integrating engineered 12-lead ECG time-series features with structured EHR variables to classify LVEF into four strata (normal >50%, mild 40-50%, moderate 30-40%, severe <30%). Trained on 36,784 retrospective ECG-echocardiogram pairs from Hartford HealthCare outpatients, it reports one-vs-rest AUROCs of 0.95 (severe), 0.92 (moderate), 0.82 (mild), and 0.91 (normal), outperforming ECG-only and EHR-only baselines, with maintained performance on temporal hold-out validation using 19,966 later ECGs; SHAP is applied for feature attribution and explainability.
Significance. If the performance holds under broader testing, the work could meaningfully advance ECG-based LVEF screening to triage echocardiography in primary care and resource-limited environments. Strengths include the multimodal design, explicit baseline comparisons, temporal validation, and SHAP-based explainability, which together provide a concrete, falsifiable performance benchmark on real patient data.
major comments (2)
- [Methods] Methods section (data collection and preprocessing): insufficient detail is provided on ECG time-series feature engineering, missing-data imputation or exclusion rules for EHR variables, and any post-hoc calibration of the XGBoost probability outputs. These omissions are load-bearing for interpreting the reported AUROCs and for reproducibility.
- [Results] Results (evaluation and generalizability): performance is assessed only via temporal split within a single U.S. health system (Hartford HealthCare). This tests time shift but leaves unexamined site-specific factors (demographics, ECG hardware/lead placement, EHR coding practices, prevalence) that would affect deployment claims for practical screening.
minor comments (1)
- [Abstract] Abstract: the sentence on 'maintained performance under temporal validation' would benefit from a brief parenthetical note on the exact AUROC values or degradation observed in the hold-out set for context.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and transparency where feasible.
read point-by-point responses
-
Referee: [Methods] Methods section (data collection and preprocessing): insufficient detail is provided on ECG time-series feature engineering, missing-data imputation or exclusion rules for EHR variables, and any post-hoc calibration of the XGBoost probability outputs. These omissions are load-bearing for interpreting the reported AUROCs and for reproducibility.
Authors: We agree that greater methodological detail is required for reproducibility. In the revised manuscript we have added a dedicated subsection under Methods that specifies the ECG time-series features extracted (QRS duration, QTc, T-wave amplitude and slope metrics, R-R interval variability, and selected frequency-domain measures), the precise exclusion rules applied to EHR variables (any variable with >20% missingness was dropped; the remainder were imputed using median for continuous features and mode for categorical features), and a statement that no post-hoc calibration was performed on the XGBoost probability outputs because the model is an uncalibrated ensemble of trees. These additions directly address the concerns raised. revision: yes
-
Referee: [Results] Results (evaluation and generalizability): performance is assessed only via temporal split within a single U.S. health system (Hartford HealthCare). This tests time shift but leaves unexamined site-specific factors (demographics, ECG hardware/lead placement, EHR coding practices, prevalence) that would affect deployment claims for practical screening.
Authors: We acknowledge that evaluation within a single health system, even with temporal validation, does not fully address site-specific variability. The temporal hold-out tests robustness to changes in patient mix and practice patterns over time, which is relevant for prospective use. We have expanded the Discussion to explicitly list the unexamined factors (demographics, hardware differences, coding practices, and prevalence shifts) as limitations and to describe the need for future multi-center validation. We do not claim the current results generalize beyond the studied population. revision: partial
- We do not have access to data from additional health systems and therefore cannot perform external multi-site validation at this time.
Circularity Check
No circularity in empirical ML performance reporting on temporal data split
full rationale
The paper trains XGBoost classifiers on 36,784 retrospective ECG-echocardiogram pairs and reports one-vs-rest AUROCs on a later temporal hold-out of 19,966 ECGs from the same system. No equations, derivations, or parameter-fitting steps are present; performance is measured directly against external (time-shifted) patient data rather than being derived from or forced by the training inputs themselves. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing support for the reported metrics. The evaluation is self-contained against real-world benchmarks and does not reduce to a renaming or self-definition of the input data.
Axiom & Free-Parameter Ledger
free parameters (1)
- XGBoost hyperparameters
axioms (1)
- domain assumption ECG timeseries features plus structured EHR variables are predictive of echocardiogram-derived LVEF strata
Reference graph
Works this paper leans on
- [1]
-
[2]
Cardiovascular diseases (CVDs) fact sheet
World Health Organization. Cardiovascular diseases (CVDs) fact sheet. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
-
[3]
Attia, Z. I. et al. Screening for cardiac contractile dysfunction using an artificial intelligence- enabled electrocardiogram. Nat. Med. 25, 70–74 (2019)
work page 2019
-
[4]
Kalmady, S. V. et al. A multitask deep learning model utilizing electrocardiograms for major cardiovascular adverse events prediction. npj Digit. Med. 8, 24 (2025)
work page 2025
-
[5]
Lee, H. J. et al. Artificial intelligence-enabled ECG for left ventricular diastolic function and filling pressure. npj Digit. Med. 7, 133 (2024)
work page 2024
-
[6]
Kim, M. et al. Prediction of left ventricular ejection fraction changes in heart failure patients using machine learning and electronic health records: A multi-site study. npj Digit. Med. 6, 149 (2023)
work page 2023
- [7]
-
[8]
Soenksen, L. R. et al. Integrated multimodal artificial intelligence framework for healthcare applications. npj Digit. Med. 5, 149 (2022)
work page 2022
-
[9]
Bertsimas, D. et al. Machine learning for real-time heart disease prediction. IEEE J. Biomed. Health Inform. 25, 3627–3637 (2021)
work page 2021
-
[10]
Christ, M. et al. Time series feature extraction on basis of scalable hypothesis tests (tsfresh – A Python package). Neurocomputing 307, 72–77 (2018)
work page 2018
-
[11]
de Hond, A. A. H. et al. Perspectives on validation of clinical predictive algorithms. npj Digit. Med. 6, 86 (2023)
work page 2023
-
[12]
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4765–4774 (2017). Tables & Figures Table 1: AUC performance of our XGBoost classifier on the internal hold-out test set, with bootstrapped 95% confidence intervals in brackets, demonstrating the benefit of including multiple data modaliti...
work page 2017
-
[13]
High-impact clinical markers (history & vitals) Several EHR variables act as strong correlates of whether the model assigns a patient to Normal LVEF. • Presence of cardiomyopathy (ICD-10 I42.9) history contributes away from “Normal LVEF” (negative SHAP), consistent with cardiomyopathy being a high-risk substrate for reduced systolic function. • Ischemic c...
-
[14]
ECG voltage summaries and rhythm-related morphology A large fraction of influential ECG predictors are voltage/amplitude summaries—especially in lateral or limb leads. • Lead I QR-interval amplitude statistics (average and median): These are among the top ECG predictors for the Normal score. The beeswarm suggests that certain ranges of Lead I QR amplitude...
-
[15]
Signal complexity / frequency-domain features • Spectral entropy (Lead V6): Frequency-domain complexity is among the top contributors for the Normal class. Higher entropy (a broader, more complex frequency distribution) tends to shift the “Normal LVEF” score in a consistent direction in this cohort. This likely reflects that certain abnormal morphologies ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.