A Multimodal and Explainable Machine Learning Approach to Diagnosing Multi-Class Ejection Fraction from Electrocardiograms

Catherine Ning; Cindy Beini Wang; Dimitris Bertsimas; Joseph Radojevic; Sean McMahon; Steven Zweibel; Yu Ma

arxiv: 2604.25942 · v1 · submitted 2026-04-17 · 💻 cs.LG

A Multimodal and Explainable Machine Learning Approach to Diagnosing Multi-Class Ejection Fraction from Electrocardiograms

Catherine Ning , Yu Ma , Cindy Beini Wang , Sean McMahon , Joseph Radojevic , Steven Zweibel , Dimitris Bertsimas This is my paper

Pith reviewed 2026-05-10 09:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords ejection fractionelectrocardiogrammachine learningmultimodalexplainable AIXGBoostSHAPheart failure

0 comments

The pith

A multimodal model using ECG timeseries features and EHR data classifies left ventricular ejection fraction into four clinical categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a machine learning approach that fuses engineered features from 12-lead ECG recordings with structured variables from electronic health records to assign patients to one of four ejection fraction groups: normal, mildly reduced, moderately reduced, or severely reduced. A reader would care because echocardiography, the standard way to measure ejection fraction, is not always available in primary care or settings with limited imaging resources, so an ECG-based screen could help decide who needs an echo. The authors train an XGBoost model on over thirty-six thousand paired records, show it beats ECG-only and EHR-only versions, maintain performance on later data from the same system, and use SHAP values to highlight which features matter most for each prediction.

Core claim

A multimodal XGBoost model trained on 36,784 ECG-echocardiogram pairs from 30,952 outpatients achieves one-vs-rest AUROCs of 0.95 for severe reduction, 0.92 for moderate, 0.82 for mild, and 0.91 for normal ejection fraction, outperforming single-modality baselines while maintaining performance under temporal validation on 19,966 later ECGs; SHAP attributions identify the most influential ECG and EHR features for each classification.

What carries the argument

The multimodal XGBoost classifier that integrates engineered 12-lead ECG timeseries features with structured EHR variables, using SHAP attributions to explain feature contributions.

Load-bearing premise

Data drawn retrospectively from outpatients at a single U.S. health system represent the populations and clinical workflows where the model would be applied.

What would settle it

Testing the trained model on ECG and EHR pairs collected from a different health system or region and finding substantially lower AUROCs would show the performance does not generalize.

read the original abstract

Left ventricular ejection fraction (LVEF) assessment depends on echocardiography, limiting access in primary care and resource-constrained settings. We developed a multimodal machine-learning framework that combines engineered 12-lead ECG timeseries features with structured EHR variables to classify LVEF into four clinically used strata: normal (>50%), mildly reduced (40-50%), moderately reduced (30-40%), and severely reduced (<30%). To support model explainability, we identified the most influential ECG and EHR features via SHAP attributions. Using retrospective data from Hartford HealthCare, we trained XGBoost models on 36,784 ECG-echocardiogram pairs from 30,952 outpatients and evaluated temporal generalizability on 19,966 ECGs from a subsequent period. The multimodal model achieved one-vs-rest AUROCs of 0.95 (severe), 0.92 (moderate), 0.82 (mild), and 0.91 (normal), outperforming ECG-only and EHR-only baselines, and maintained performance under temporal validation. This work supports ECG-based, multimodal LVEF stratification as a practical screening and triage aid to prioritize confirmatory imaging where resources are limited.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competent single-center multimodal XGBoost study for four-class LVEF classification from ECG plus EHR data that holds up under temporal validation but does not test external sites.

read the letter

The main takeaway is that the authors built an XGBoost model fusing engineered 12-lead ECG features with structured EHR variables to classify LVEF into normal, mild, moderate, and severe reduction. On a large retrospective set from Hartford HealthCare it reaches one-vs-rest AUROCs of 0.95, 0.92, 0.82, and 0.91 respectively, beats the single-modality baselines, and keeps performance on a later temporal hold-out set. SHAP attributions are used to surface influential features for explainability.

Referee Report

2 major / 1 minor

Summary. The paper develops a multimodal XGBoost model integrating engineered 12-lead ECG time-series features with structured EHR variables to classify LVEF into four strata (normal >50%, mild 40-50%, moderate 30-40%, severe <30%). Trained on 36,784 retrospective ECG-echocardiogram pairs from Hartford HealthCare outpatients, it reports one-vs-rest AUROCs of 0.95 (severe), 0.92 (moderate), 0.82 (mild), and 0.91 (normal), outperforming ECG-only and EHR-only baselines, with maintained performance on temporal hold-out validation using 19,966 later ECGs; SHAP is applied for feature attribution and explainability.

Significance. If the performance holds under broader testing, the work could meaningfully advance ECG-based LVEF screening to triage echocardiography in primary care and resource-limited environments. Strengths include the multimodal design, explicit baseline comparisons, temporal validation, and SHAP-based explainability, which together provide a concrete, falsifiable performance benchmark on real patient data.

major comments (2)

[Methods] Methods section (data collection and preprocessing): insufficient detail is provided on ECG time-series feature engineering, missing-data imputation or exclusion rules for EHR variables, and any post-hoc calibration of the XGBoost probability outputs. These omissions are load-bearing for interpreting the reported AUROCs and for reproducibility.
[Results] Results (evaluation and generalizability): performance is assessed only via temporal split within a single U.S. health system (Hartford HealthCare). This tests time shift but leaves unexamined site-specific factors (demographics, ECG hardware/lead placement, EHR coding practices, prevalence) that would affect deployment claims for practical screening.

minor comments (1)

[Abstract] Abstract: the sentence on 'maintained performance under temporal validation' would benefit from a brief parenthetical note on the exact AUROC values or degradation observed in the hold-out set for context.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and transparency where feasible.

read point-by-point responses

Referee: [Methods] Methods section (data collection and preprocessing): insufficient detail is provided on ECG time-series feature engineering, missing-data imputation or exclusion rules for EHR variables, and any post-hoc calibration of the XGBoost probability outputs. These omissions are load-bearing for interpreting the reported AUROCs and for reproducibility.

Authors: We agree that greater methodological detail is required for reproducibility. In the revised manuscript we have added a dedicated subsection under Methods that specifies the ECG time-series features extracted (QRS duration, QTc, T-wave amplitude and slope metrics, R-R interval variability, and selected frequency-domain measures), the precise exclusion rules applied to EHR variables (any variable with >20% missingness was dropped; the remainder were imputed using median for continuous features and mode for categorical features), and a statement that no post-hoc calibration was performed on the XGBoost probability outputs because the model is an uncalibrated ensemble of trees. These additions directly address the concerns raised. revision: yes
Referee: [Results] Results (evaluation and generalizability): performance is assessed only via temporal split within a single U.S. health system (Hartford HealthCare). This tests time shift but leaves unexamined site-specific factors (demographics, ECG hardware/lead placement, EHR coding practices, prevalence) that would affect deployment claims for practical screening.

Authors: We acknowledge that evaluation within a single health system, even with temporal validation, does not fully address site-specific variability. The temporal hold-out tests robustness to changes in patient mix and practice patterns over time, which is relevant for prospective use. We have expanded the Discussion to explicitly list the unexamined factors (demographics, hardware differences, coding practices, and prevalence shifts) as limitations and to describe the need for future multi-center validation. We do not claim the current results generalize beyond the studied population. revision: partial

standing simulated objections not resolved

We do not have access to data from additional health systems and therefore cannot perform external multi-site validation at this time.

Circularity Check

0 steps flagged

No circularity in empirical ML performance reporting on temporal data split

full rationale

The paper trains XGBoost classifiers on 36,784 retrospective ECG-echocardiogram pairs and reports one-vs-rest AUROCs on a later temporal hold-out of 19,966 ECGs from the same system. No equations, derivations, or parameter-fitting steps are present; performance is measured directly against external (time-shifted) patient data rather than being derived from or forced by the training inputs themselves. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing support for the reported metrics. The evaluation is self-contained against real-world benchmarks and does not reduce to a renaming or self-definition of the input data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised-learning assumptions plus the domain premise that ECG morphology and basic EHR fields contain sufficient signal for LVEF strata; no new physical entities are postulated.

free parameters (1)

XGBoost hyperparameters
Tuned on the training split to optimize multi-class AUROC; exact values not stated in abstract.

axioms (1)

domain assumption ECG timeseries features plus structured EHR variables are predictive of echocardiogram-derived LVEF strata
Invoked by training the classifier on paired data without further justification in the abstract.

pith-pipeline@v0.9.0 · 5533 in / 1317 out tokens · 71662 ms · 2026-05-10T09:34:54.895941+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

A., Oh, S

Cook, D. A., Oh, S. Y. & Pusic, M. V. Accuracy of physicians' electrocardiogram interpretations: A systematic review and meta-analysis. JAMA Intern. Med. 180, 1461–1471 (2020)

work page 2020
[2]

Cardiovascular diseases (CVDs) fact sheet

World Health Organization. Cardiovascular diseases (CVDs) fact sheet. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)

work page
[3]

Attia, Z. I. et al. Screening for cardiac contractile dysfunction using an artificial intelligence- enabled electrocardiogram. Nat. Med. 25, 70–74 (2019)

work page 2019
[4]

Kalmady, S. V. et al. A multitask deep learning model utilizing electrocardiograms for major cardiovascular adverse events prediction. npj Digit. Med. 8, 24 (2025)

work page 2025
[5]

Lee, H. J. et al. Artificial intelligence-enabled ECG for left ventricular diastolic function and filling pressure. npj Digit. Med. 7, 133 (2024)

work page 2024
[6]

Kim, M. et al. Prediction of left ventricular ejection fraction changes in heart failure patients using machine learning and electronic health records: A multi-site study. npj Digit. Med. 6, 149 (2023)

work page 2023
[7]

Han, Y. et al. AI for regulatory affairs: Balancing accuracy, interpretability, and computational cost in medical device classification. arXiv 2505.18695 (2025)

work page arXiv 2025
[8]

Soenksen, L. R. et al. Integrated multimodal artificial intelligence framework for healthcare applications. npj Digit. Med. 5, 149 (2022)

work page 2022
[9]

Bertsimas, D. et al. Machine learning for real-time heart disease prediction. IEEE J. Biomed. Health Inform. 25, 3627–3637 (2021)

work page 2021
[10]

Christ, M. et al. Time series feature extraction on basis of scalable hypothesis tests (tsfresh – A Python package). Neurocomputing 307, 72–77 (2018)

work page 2018
[11]

de Hond, A. A. H. et al. Perspectives on validation of clinical predictive algorithms. npj Digit. Med. 6, 86 (2023)

work page 2023
[12]

normal LVEF

Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4765–4774 (2017). Tables & Figures Table 1: AUC performance of our XGBoost classifier on the internal hold-out test set, with bootstrapped 95% confidence intervals in brackets, demonstrating the benefit of including multiple data modaliti...

work page 2017
[13]

Normal LVEF

High-impact clinical markers (history & vitals) Several EHR variables act as strong correlates of whether the model assigns a patient to Normal LVEF. • Presence of cardiomyopathy (ICD-10 I42.9) history contributes away from “Normal LVEF” (negative SHAP), consistent with cardiomyopathy being a high-risk substrate for reduced systolic function. • Ischemic c...

work page
[14]

Normal LVEF

ECG voltage summaries and rhythm-related morphology A large fraction of influential ECG predictors are voltage/amplitude summaries—especially in lateral or limb leads. • Lead I QR-interval amplitude statistics (average and median): These are among the top ECG predictors for the Normal score. The beeswarm suggests that certain ranges of Lead I QR amplitude...

work page
[15]

Normal LVEF

Signal complexity / frequency-domain features • Spectral entropy (Lead V6): Frequency-domain complexity is among the top contributors for the Normal class. Higher entropy (a broader, more complex frequency distribution) tends to shift the “Normal LVEF” score in a consistent direction in this cohort. This likely reflects that certain abnormal morphologies ...

work page

[1] [1]

A., Oh, S

Cook, D. A., Oh, S. Y. & Pusic, M. V. Accuracy of physicians' electrocardiogram interpretations: A systematic review and meta-analysis. JAMA Intern. Med. 180, 1461–1471 (2020)

work page 2020

[2] [2]

Cardiovascular diseases (CVDs) fact sheet

World Health Organization. Cardiovascular diseases (CVDs) fact sheet. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)

work page

[3] [3]

Attia, Z. I. et al. Screening for cardiac contractile dysfunction using an artificial intelligence- enabled electrocardiogram. Nat. Med. 25, 70–74 (2019)

work page 2019

[4] [4]

Kalmady, S. V. et al. A multitask deep learning model utilizing electrocardiograms for major cardiovascular adverse events prediction. npj Digit. Med. 8, 24 (2025)

work page 2025

[5] [5]

Lee, H. J. et al. Artificial intelligence-enabled ECG for left ventricular diastolic function and filling pressure. npj Digit. Med. 7, 133 (2024)

work page 2024

[6] [6]

Kim, M. et al. Prediction of left ventricular ejection fraction changes in heart failure patients using machine learning and electronic health records: A multi-site study. npj Digit. Med. 6, 149 (2023)

work page 2023

[7] [7]

Han, Y. et al. AI for regulatory affairs: Balancing accuracy, interpretability, and computational cost in medical device classification. arXiv 2505.18695 (2025)

work page arXiv 2025

[8] [8]

Soenksen, L. R. et al. Integrated multimodal artificial intelligence framework for healthcare applications. npj Digit. Med. 5, 149 (2022)

work page 2022

[9] [9]

Bertsimas, D. et al. Machine learning for real-time heart disease prediction. IEEE J. Biomed. Health Inform. 25, 3627–3637 (2021)

work page 2021

[10] [10]

Christ, M. et al. Time series feature extraction on basis of scalable hypothesis tests (tsfresh – A Python package). Neurocomputing 307, 72–77 (2018)

work page 2018

[11] [11]

de Hond, A. A. H. et al. Perspectives on validation of clinical predictive algorithms. npj Digit. Med. 6, 86 (2023)

work page 2023

[12] [12]

normal LVEF

Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4765–4774 (2017). Tables & Figures Table 1: AUC performance of our XGBoost classifier on the internal hold-out test set, with bootstrapped 95% confidence intervals in brackets, demonstrating the benefit of including multiple data modaliti...

work page 2017

[13] [13]

Normal LVEF

High-impact clinical markers (history & vitals) Several EHR variables act as strong correlates of whether the model assigns a patient to Normal LVEF. • Presence of cardiomyopathy (ICD-10 I42.9) history contributes away from “Normal LVEF” (negative SHAP), consistent with cardiomyopathy being a high-risk substrate for reduced systolic function. • Ischemic c...

work page

[14] [14]

Normal LVEF

ECG voltage summaries and rhythm-related morphology A large fraction of influential ECG predictors are voltage/amplitude summaries—especially in lateral or limb leads. • Lead I QR-interval amplitude statistics (average and median): These are among the top ECG predictors for the Normal score. The beeswarm suggests that certain ranges of Lead I QR amplitude...

work page

[15] [15]

Normal LVEF

Signal complexity / frequency-domain features • Spectral entropy (Lead V6): Frequency-domain complexity is among the top contributors for the Normal class. Higher entropy (a broader, more complex frequency distribution) tends to shift the “Normal LVEF” score in a consistent direction in this cohort. This likely reflects that certain abnormal morphologies ...

work page