Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study
Pith reviewed 2026-06-26 09:25 UTC · model grok-4.3
The pith
Zero-shot multimodal LLMs reach only chance-level accuracy on ECG abnormality detection while physiology-aware CNNs reach 0.92-0.94 ROC-AUC.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Standard 12-lead ECG images were presented to GPT-5.2, GPT-4.1 and Gemini-2.5 Pro under a fixed zero-shot prompt; all three models produced ROC-AUC values around 0.5. In parallel, a physiology-aware CNN (LeadGroupECG) that aggregates features from predefined anatomical lead groups reached average internal ROC-AUC 0.92-0.94 and external ROC-AUC 0.85-0.86 on PTB-XL, outperforming its backbone internally while remaining competitive with ResNet18, DenseNet121 and VGG16. The same CNN models stayed stable across random seeds and continued to highlight anatomical lead-group contributions.
What carries the argument
LeadGroupECG model that aggregates convolutional features from predefined anatomical lead groups before final classification.
If this is right
- CNN performance remains stable across internal and external datasets while LLM performance stays near chance.
- Anatomical lead-group aggregation improves internal discrimination without harming external generalization.
- Grid-based calibration backgrounds yield modest PR-AUC gains over grid-free images for the LLMs.
- Multimodal LLMs can still produce narrative descriptions even when their binary discrimination fails.
Where Pith is reading between the lines
- If the chance-level LLM result holds under varied prompts, hospitals would need separate domain-specific models rather than relying on general-purpose LLMs for ECG triage.
- The lead-group mechanism could be tested on finer-grained tasks such as arrhythmia subtyping to check whether anatomical grouping scales beyond binary normal-abnormal decisions.
Load-bearing premise
The internal test set and PTB-XL are representative of real clinical ECG distributions and the single fixed zero-shot prompt fairly tests LLM capability without hidden artifacts.
What would settle it
A replication that applies varied prompts, multi-page ECG layouts or a larger external ECG corpus and measures whether any LLM configuration exceeds 0.6 ROC-AUC on the same normal-abnormal task.
read the original abstract
Multimodal large language models (LLMs) are increasingly adopted to interpret 12-lead ECG images, though the interpretations often lack validation. However, ECG image understanding significantly differs from general images as it depends on precise waveform morphology, lead relationships and accurate interval measurements. This study investigated whether zero-shot multimodal LLMs can reliably distinguish normal and abnormal ECG images and, in parallel, evaluated CNN-based models for clinically grounded references. Standard 12-lead ECG recordings were rendered as single-page images for a binary normal-abnormal classification task. Three prominent LLMs (GPT-5.2, GPT-4.1, and Gemini-2.5 Pro) were tested using a fixed zero-shot prompt across multiple runs. In parallel, a physiology-aware CNN-based model was developed with the capability to aggregate features from the predefined anatomical lead groups. The model was compared with ResNet18, DenseNet121, VGG16 baselines, and all the models were evaluated on an internal test set and external PTB-XL dataset. Across seeds, CNN-based models demonstrated stable discrimination, with average internal ROC-AUC of 0.92-0.94, and external ROC-AUC of 0.85-0.86. The proposed LeadGroupECG model significantly improved over its backbone internally without compromising external generalization. It remained competitive with other baselines, while consistently highlighting anatomical lead-group contributions. In contrast, zero-shot LLM discrimination remained near-chance (ROC-AUC around 0.5). The PR-AUC improved slightly when ECGs used a grid-based calibration background compared with the grid-free ECGs. Although multimodal LLMs can generate reasonable ECG narratives, their zero-shot diagnostic discrimination remains limited. Therefore, clinically framed, domain-specific architectures remain essential for AI-based ECG interpretation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper compares a proposed physiology-aware CNN (LeadGroupECG) that aggregates features from predefined anatomical lead groups against standard CNN baselines (ResNet18, DenseNet121, VGG16) and three zero-shot multimodal LLMs (GPT-5.2, GPT-4.1, Gemini-2.5 Pro) on binary normal/abnormal classification of single-page rendered 12-lead ECG images. CNN models achieve stable high performance (internal ROC-AUC 0.92-0.94, external PTB-XL 0.85-0.86) with the proposed model improving internally without harming generalization, while LLMs remain near chance (ROC-AUC ~0.5); the conclusion is that domain-specific architectures remain essential.
Significance. If the empirical results hold after supplying missing details, the work provides concrete evidence that zero-shot multimodal LLMs currently lack reliable discrimination on ECG waveform morphology and interval tasks despite narrative generation ability, reinforcing the value of clinically framed CNNs with lead-group structure for medical signal interpretation and offering both internal and external validation.
major comments (3)
- [Abstract] Abstract and Methods: The reported ROC-AUC (0.92-0.94 internal, 0.85-0.86 external) and PR-AUC values for both CNNs and LLMs are supplied without dataset sizes, class balance, total sample counts, or any statistical tests, which are load-bearing for assessing whether the performance gap and the claim of LLM near-chance discrimination are reliable.
- [Methods] Methods (LLM evaluation paragraph): The fixed zero-shot prompt text is not reproduced and image rendering details (resolution, single-page layout parameters, grid vs. grid-free specifics) are omitted; this directly affects evaluation of the skeptic concern that the ~0.5 ROC-AUC may partly reflect prompt or rendering artifacts rather than inherent limitation, undermining the generalization to 'clinically framed, domain-specific architectures remain essential'.
- [Results] Results: The statement that LeadGroupECG 'significantly improved over its backbone internally' lacks reported p-values, confidence intervals, or effect sizes, and the external generalization claim cannot be fully evaluated without sample sizes or per-class metrics.
minor comments (1)
- [Abstract] Clarify the exact versions or access dates for the cited LLMs (GPT-5.2, GPT-4.1) as these labels are non-standard.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the missing details limit full evaluation of the results and have prepared revisions to supply them. Point-by-point responses are provided below.
read point-by-point responses
-
Referee: [Abstract] Abstract and Methods: The reported ROC-AUC (0.92-0.94 internal, 0.85-0.86 external) and PR-AUC values for both CNNs and LLMs are supplied without dataset sizes, class balance, total sample counts, or any statistical tests, which are load-bearing for assessing whether the performance gap and the claim of LLM near-chance discrimination are reliable.
Authors: We agree these details are necessary. The revised manuscript will report internal and external dataset sizes, class balances, total sample counts, and statistical tests (including p-values and 95% confidence intervals) for all AUC and PR-AUC values to support the performance gap claims. revision: yes
-
Referee: [Methods] Methods (LLM evaluation paragraph): The fixed zero-shot prompt text is not reproduced and image rendering details (resolution, single-page layout parameters, grid vs. grid-free specifics) are omitted; this directly affects evaluation of the skeptic concern that the ~0.5 ROC-AUC may partly reflect prompt or rendering artifacts rather than inherent limitation, undermining the generalization to 'clinically framed, domain-specific architectures remain essential'.
Authors: The exact zero-shot prompt will be reproduced in the Methods. Image rendering parameters (resolution, single-page layout, grid vs. grid-free details) will be added to allow assessment of potential artifacts and strengthen the generalization argument. revision: yes
-
Referee: [Results] Results: The statement that LeadGroupECG 'significantly improved over its backbone internally' lacks reported p-values, confidence intervals, or effect sizes, and the external generalization claim cannot be fully evaluated without sample sizes or per-class metrics.
Authors: We will add p-values, confidence intervals, and effect sizes for the internal improvement. External PTB-XL results will include sample sizes and per-class metrics (e.g., sensitivity, specificity) to fully support the generalization claims. revision: yes
Circularity Check
No circularity: purely empirical comparison with measured metrics
full rationale
The paper reports experimental results from training and evaluating CNN models (including a proposed LeadGroupECG variant) and testing zero-shot multimodal LLMs on binary ECG classification, using internal held-out data and external PTB-XL validation. All claims rest on observed ROC-AUC, PR-AUC and related performance numbers rather than any derivation, equation, fitted parameter renamed as prediction, or self-citation chain. No load-bearing step reduces to its own inputs by construction; the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ROC-AUC and PR-AUC are suitable metrics for assessing binary ECG image classification performance.
Reference graph
Works this paper leans on
-
[1]
Artificial intelligence -enhanced electrocardiography in cardiovascular disease management
Siontis KC, Noseworthy PA, Attia ZI, Friedman PA. Artificial intelligence -enhanced electrocardiography in cardiovascular disease management. Nat Rev Cardiol. 2021;18:465 –78. https://doi.org/10.1038/s41569-020-00503-2
-
[2]
Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Dieter Kreiseler, Fatima I
Strodthoff N, Wagner P, Schaeffter T, Samek W. Deep Learning for ECG Analysis: Benchmarks and Insights from PTB -XL. IEEE J Biomed Health Inform. 2021;25:1519 –28. https://doi.org/10.1109/JBHI.2020.3022989
-
[3]
Hannun AY , Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med. 2019;25:65–9. https://doi.org/10.1038/s41591-018-0268-3
-
[4]
Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks
Rajpurkar P, Hannun AY , Haghpanahi M, Bourn C, Ng AY . Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. 2017. https://doi.org/10.48550/arXiv.1707.01836
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.01836 2017
-
[5]
Saha Tchinda B, Tchiotsop D. A lightweight 1D convolutional neural network model for arrhythmia diagnosis from electrocardiogram signal. Phys Eng Sci Med. 2025;48:577 –89. https://doi.org/10.1007/s13246-025-01525-1
-
[6]
Narotamo H, Dias M, Santos R, Carreiro A V , Gamboa H, Silveira M. Deep learning for ECG classification: A comparative study of 1D and 2D representations and multimodal fusion approaches. Biomed Signal Process Control. 2024;93:106141. https://doi.org/10.1016/j.bspc.2024.106141
-
[7]
Automatic diagnosis of the 12 -lead ECG using a deep neural network
Ribeiro AH, Ribeiro MH, Paixão GMM, Oliveira DM, Gomes PR, Canazart JA, et al. Automatic diagnosis of the 12 -lead ECG using a deep neural network. Nat Commun. 2020;11:1760. https://doi.org/10.1038/s41467-020-15432-4. K. Ahammad et al. 21
-
[8]
Deep learning approach for active classification of electrocardiogram signals
Rahhal MMA, Bazi Y , AlHichri H, Alajlan N, Melgani F, Yager RR. Deep learning approach for active classification of electrocardiogram signals. Inf Sci. 2016;345:340 –54. https://doi.org/10.1016/j.ins.2016.01.082
-
[9]
Zhang J, Liu A, Gao M, Chen X, Zhang X, Chen X. ECG -based multi-class arrhythmia detection using spatio -temporal attention -based convolutional recurrent neural network. Artif Intell Med. 2020;106:101856. https://doi.org/10.1016/j.artmed.2020.101856
-
[10]
Zhang S, Lian C, Xu B, Su Y , Alhudhaif A. 12-Lead ECG signal classification for detecting ECG arrhythmia via an information bottleneck -based multi -scale network. Inf Sci. 2024;662:120239. https://doi.org/10.1016/j.ins.2024.120239
-
[11]
ECG signal classification based on deep CNN and BiLSTM
Cheng J, Zou Q, Zhao Y . ECG signal classification based on deep CNN and BiLSTM. BMC Med Inform Decis Mak. 2021;21:365. https://doi.org/10.1186/s12911-021-01736-y
-
[12]
A Deep-Learning Approach to ECG Classification Based on Adversarial Domain Adaptation
Niu L, Chen C, Liu H, Zhou S, Shu M. A Deep-Learning Approach to ECG Classification Based on Adversarial Domain Adaptation. Healthcare. 2020;8:437. https://doi.org/10.3390/healthcare8040437
-
[13]
Clinically meaningful interpretability of an AI model for ECG classification
Gliner V , Levy I, Tsutsui K, Acha MR, Schliamser J, Schuster A, et al. Clinically meaningful interpretability of an AI model for ECG classification. Npj Digit Med. 2025;8:109. https://doi.org/10.1038/s41746-025-01467-8
-
[14]
Bellfield RAA, Ortega -Martorell S, Lip GYH, Oxborough D, Olier I. Impact of ECG data format on the performance of machine learning models for the prediction of myocardial infarction. J Electrocardiol. 2024;84:17–26. https://doi.org/10.1016/j.jelectrocard.2024.03.005
-
[15]
FM-ECG: A fine-grained multi-label framework for ECG image classification
Du N, Cao Q, Yu L, Liu N, Zhong E, Liu Z, et al. FM-ECG: A fine-grained multi-label framework for ECG image classification. Inf Sci. 2021;549:164–77. https://doi.org/10.1016/j.ins.2020.10.014
-
[16]
Electrocardiogram image classification for six classes of heart diseases
Oke OA, Cavus N. Electrocardiogram image classification for six classes of heart diseases. Iran J Comput Sci. 2025;8:419–39. https://doi.org/10.1007/s42044-025-00227-x
-
[17]
Alsayat A, Mahmoud AA, Alanazi S, Mostafa AM, Alshammari N, Alrowaily MA, et al. Enhancing cardiac diagnostics: a deep learning ensemble approach for precise ECG image classification. J Big Data. 2025;12:7. https://doi.org/10.1186/s40537-025-01070-4
-
[18]
Image based deep learning in 12 -lead ECG diagnosis
Ao R, He G. Image based deep learning in 12 -lead ECG diagnosis. Front Artif Intell. 2023;5:1087370. https://doi.org/10.3389/frai.2022.1087370
-
[19]
Leads, Axis, and Acquisition of the 12 -Lead ECG
Dingler A, Vandeventer S, Borkosky J, Henrichs B, McConachie A, Muthersbaugh HC, et al. Leads, Axis, and Acquisition of the 12 -Lead ECG. In: The 12 -Lead ECG in Acute Coronary Syndromes. Elsevier; 2019
2019
-
[20]
Recommendations for the Standardization and Interpretation of the Electrocardiogram
Kligfield P, Gettes LS, Bailey JJ, Childers R, Deal BJ, Hancock EW, et al. Recommendations for the Standardization and Interpretation of the Electrocardiogram. J Am Coll Cardiol. 2007;49:1109–27. https://doi.org/10.1016/j.jacc.2007.01.024
-
[21]
Rabbani SA, El-Tanani M, Sharma S, Rabbani SS, El-Tanani Y , Kumar R, et al. Generative Artificial Intelligence in Healthcare: Applications, Implementation Challenges, and Future Directions. BioMedInformatics. 2025;5:37. https://doi.org/10.3390/biomedinformatics5030037
-
[22]
Engelstein H, Ramon-Gonen R, Sabbag A, Klang E, Sudri K, Cohen-Shelly M, et al. Effectiveness of the GPT-4o Model in Interpreting Electrocardiogram Images for Cardiac Diagnostics: Diagnostic Accuracy Study. JMIR AI. 2025;4:e74426–e74426. https://doi.org/10.2196/74426
-
[23]
ECG-Doctor: An Interpretable Multimodal ECG Diagnosis Framework Based on Large Language Models
Tian D, Jiang J, Zhang K, Liu C, Yuan Y , Gao M, et al. ECG-Doctor: An Interpretable Multimodal ECG Diagnosis Framework Based on Large Language Models. In: Proceedings of the 34th ACM International Conference on Information and Knowledge Management. Seo ul Republic of Korea: ACM; 2025. p. 2863–73. https://doi.org/10.1145/3746252.3761082
-
[24]
Yu H, Guo P, Sano A. ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text. 2024. https://doi.org/10.48550/arXiv.2405.19366. Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study 22
-
[25]
Zero-shot ecg classification with multimodal learning and test-time clinical knowledge enhancement
Liu C, Wan Z, Ouyang C, Shah A, Bai W, Arcucci R. Zero-Shot ECG Classification with Multimodal Learning and Test -time Clinical Knowledge Enhancement. 2024. https://doi.org/10.48550/arXiv.2403.06659
-
[26]
Zero -Shot ECG Diagnosis with Large Language Models and Retrieval - Augmented Generation
Yu H, Guo P, Sano A. Zero -Shot ECG Diagnosis with Large Language Models and Retrieval - Augmented Generation
-
[27]
Yang K, Hong M, Zhang J, Luo Y , Zhao S, Zhang O, et al. ECG -LM: Understanding Electrocardiogram with a Large Language Model. Health Data Sci. 2025;5:0221. https://doi.org/10.34133/hds.0221
-
[28]
https://doi.org/10.1007/978-1-4842-4470-8_7
Ekaba B. Google Colaboratory. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform. Berkeley, CA: Apress. https://doi.org/10.1007/978-1-4842-4470-8_7
-
[29]
Liu H, Chen D, Chen D, Zhang X, Li H, Bian L, et al. A large -scale multi -label 12 -lead electrocardiogram database with standardized diagnostic statements. Sci Data. 2022;9:272. https://doi.org/10.1038/s41597-022-01403-5
-
[30]
PTB -XL, a large publicly available electrocardiography dataset
Wagner P, Strodthoff N, Bousseljot R-D, Kreiseler D, Lunze FI, Samek W, et al. PTB -XL, a large publicly available electrocardiography dataset. Sci Data. 2020;7:154. https://doi.org/10.1038/s41597 - 020-0495-6
-
[31]
CNN -FWS: A Model for the Diagnosis of Normal and Abnormal ECG with Feature Adaptive
Zhu J, Lv J, Kong D. CNN -FWS: A Model for the Diagnosis of Normal and Abnormal ECG with Feature Adaptive. Entropy. 2022;24:471. https://doi.org/10.3390/e24040471
-
[32]
Deep Residual Learning for Image Recognition
He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2015. https://doi.org/10.48550/arXiv.1512.03385
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1512.03385 2015
-
[33]
Densely Connected Convolutional Networks
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE
2017
-
[34]
p. 2261–9. https://doi.org/10.1109/CVPR.2017.243
-
[35]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition
-
[36]
https://doi.org/10.48550/arXiv.1409.1556
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.1556
-
[37]
Analysis of an adaptive lead weighted ResNet for multiclass classification of 12 -lead ECGs
Zhao Z, Murphy D, Gifford H, Williams S, Darlington A, Relton SD, et al. Analysis of an adaptive lead weighted ResNet for multiclass classification of 12 -lead ECGs. Physiol Meas. 2022;43:034001. https://doi.org/10.1088/1361-6579/ac5b4a
-
[38]
ECG Heartbeat Classification Based on an Improved ResNet-18 Model
Jing E, Zhang H, Li Z, Liu Y , Ji Z, Ganchev I. ECG Heartbeat Classification Based on an Improved ResNet-18 Model. Comput Math Methods Med. 2021;2021:1 –13. https://doi.org/10.1155/2021/6649970
-
[39]
Automatic varied -length ECG classification using a lightweight DenseNet model
Bui TH, Hoang VM, Pham MT. Automatic varied -length ECG classification using a lightweight DenseNet model. Biomed Signal Process Control. 2023;82:104529. https://doi.org/10.1016/j.bspc.2022.104529
-
[40]
Rashed -Al-Mahfuz Md, Moni MA, Lio’ P, Islam SMS, Berkovsky S, Khushi M, et al. Deep convolutional neural networks based ECG beats classification to diagnose cardiovascular conditions. Biomed Eng Lett. 2021;11:147–62. https://doi.org/10.1007/s13534-021-00185-w
-
[41]
ImageNet: A large-scale hierarchical image database
Deng J, Dong W, Socher R, Li L-J, Kai Li, Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL: IEEE
2009
-
[42]
p. 248–55. https://doi.org/10.1109/CVPR.2009.5206848
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.