pith. sign in

arxiv: 2606.22889 · v1 · pith:Q2AI5YYZnew · submitted 2026-06-22 · 💻 cs.LG

Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study

Pith reviewed 2026-06-26 09:25 UTC · model grok-4.3

classification 💻 cs.LG
keywords ECG image classificationmultimodal LLMszero-shot learningphysiology-aware CNNlead-group aggregationPTB-XL datasetROC-AUC evaluation
0
0 comments X

The pith

Zero-shot multimodal LLMs reach only chance-level accuracy on ECG abnormality detection while physiology-aware CNNs reach 0.92-0.94 ROC-AUC.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether zero-shot multimodal LLMs can classify 12-lead ECG images as normal or abnormal. It finds that three leading models (GPT-5.2, GPT-4.1, Gemini-2.5 Pro) perform near random chance. In the same setting a custom CNN that groups leads by anatomical regions achieves stable high accuracy on both internal and external data. The authors conclude that clinically framed, domain-specific architectures remain necessary for reliable ECG interpretation.

Core claim

Standard 12-lead ECG images were presented to GPT-5.2, GPT-4.1 and Gemini-2.5 Pro under a fixed zero-shot prompt; all three models produced ROC-AUC values around 0.5. In parallel, a physiology-aware CNN (LeadGroupECG) that aggregates features from predefined anatomical lead groups reached average internal ROC-AUC 0.92-0.94 and external ROC-AUC 0.85-0.86 on PTB-XL, outperforming its backbone internally while remaining competitive with ResNet18, DenseNet121 and VGG16. The same CNN models stayed stable across random seeds and continued to highlight anatomical lead-group contributions.

What carries the argument

LeadGroupECG model that aggregates convolutional features from predefined anatomical lead groups before final classification.

If this is right

  • CNN performance remains stable across internal and external datasets while LLM performance stays near chance.
  • Anatomical lead-group aggregation improves internal discrimination without harming external generalization.
  • Grid-based calibration backgrounds yield modest PR-AUC gains over grid-free images for the LLMs.
  • Multimodal LLMs can still produce narrative descriptions even when their binary discrimination fails.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the chance-level LLM result holds under varied prompts, hospitals would need separate domain-specific models rather than relying on general-purpose LLMs for ECG triage.
  • The lead-group mechanism could be tested on finer-grained tasks such as arrhythmia subtyping to check whether anatomical grouping scales beyond binary normal-abnormal decisions.

Load-bearing premise

The internal test set and PTB-XL are representative of real clinical ECG distributions and the single fixed zero-shot prompt fairly tests LLM capability without hidden artifacts.

What would settle it

A replication that applies varied prompts, multi-page ECG layouts or a larger external ECG corpus and measures whether any LLM configuration exceeds 0.6 ROC-AUC on the same normal-abnormal task.

read the original abstract

Multimodal large language models (LLMs) are increasingly adopted to interpret 12-lead ECG images, though the interpretations often lack validation. However, ECG image understanding significantly differs from general images as it depends on precise waveform morphology, lead relationships and accurate interval measurements. This study investigated whether zero-shot multimodal LLMs can reliably distinguish normal and abnormal ECG images and, in parallel, evaluated CNN-based models for clinically grounded references. Standard 12-lead ECG recordings were rendered as single-page images for a binary normal-abnormal classification task. Three prominent LLMs (GPT-5.2, GPT-4.1, and Gemini-2.5 Pro) were tested using a fixed zero-shot prompt across multiple runs. In parallel, a physiology-aware CNN-based model was developed with the capability to aggregate features from the predefined anatomical lead groups. The model was compared with ResNet18, DenseNet121, VGG16 baselines, and all the models were evaluated on an internal test set and external PTB-XL dataset. Across seeds, CNN-based models demonstrated stable discrimination, with average internal ROC-AUC of 0.92-0.94, and external ROC-AUC of 0.85-0.86. The proposed LeadGroupECG model significantly improved over its backbone internally without compromising external generalization. It remained competitive with other baselines, while consistently highlighting anatomical lead-group contributions. In contrast, zero-shot LLM discrimination remained near-chance (ROC-AUC around 0.5). The PR-AUC improved slightly when ECGs used a grid-based calibration background compared with the grid-free ECGs. Although multimodal LLMs can generate reasonable ECG narratives, their zero-shot diagnostic discrimination remains limited. Therefore, clinically framed, domain-specific architectures remain essential for AI-based ECG interpretation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper compares a proposed physiology-aware CNN (LeadGroupECG) that aggregates features from predefined anatomical lead groups against standard CNN baselines (ResNet18, DenseNet121, VGG16) and three zero-shot multimodal LLMs (GPT-5.2, GPT-4.1, Gemini-2.5 Pro) on binary normal/abnormal classification of single-page rendered 12-lead ECG images. CNN models achieve stable high performance (internal ROC-AUC 0.92-0.94, external PTB-XL 0.85-0.86) with the proposed model improving internally without harming generalization, while LLMs remain near chance (ROC-AUC ~0.5); the conclusion is that domain-specific architectures remain essential.

Significance. If the empirical results hold after supplying missing details, the work provides concrete evidence that zero-shot multimodal LLMs currently lack reliable discrimination on ECG waveform morphology and interval tasks despite narrative generation ability, reinforcing the value of clinically framed CNNs with lead-group structure for medical signal interpretation and offering both internal and external validation.

major comments (3)
  1. [Abstract] Abstract and Methods: The reported ROC-AUC (0.92-0.94 internal, 0.85-0.86 external) and PR-AUC values for both CNNs and LLMs are supplied without dataset sizes, class balance, total sample counts, or any statistical tests, which are load-bearing for assessing whether the performance gap and the claim of LLM near-chance discrimination are reliable.
  2. [Methods] Methods (LLM evaluation paragraph): The fixed zero-shot prompt text is not reproduced and image rendering details (resolution, single-page layout parameters, grid vs. grid-free specifics) are omitted; this directly affects evaluation of the skeptic concern that the ~0.5 ROC-AUC may partly reflect prompt or rendering artifacts rather than inherent limitation, undermining the generalization to 'clinically framed, domain-specific architectures remain essential'.
  3. [Results] Results: The statement that LeadGroupECG 'significantly improved over its backbone internally' lacks reported p-values, confidence intervals, or effect sizes, and the external generalization claim cannot be fully evaluated without sample sizes or per-class metrics.
minor comments (1)
  1. [Abstract] Clarify the exact versions or access dates for the cited LLMs (GPT-5.2, GPT-4.1) as these labels are non-standard.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the missing details limit full evaluation of the results and have prepared revisions to supply them. Point-by-point responses are provided below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Methods: The reported ROC-AUC (0.92-0.94 internal, 0.85-0.86 external) and PR-AUC values for both CNNs and LLMs are supplied without dataset sizes, class balance, total sample counts, or any statistical tests, which are load-bearing for assessing whether the performance gap and the claim of LLM near-chance discrimination are reliable.

    Authors: We agree these details are necessary. The revised manuscript will report internal and external dataset sizes, class balances, total sample counts, and statistical tests (including p-values and 95% confidence intervals) for all AUC and PR-AUC values to support the performance gap claims. revision: yes

  2. Referee: [Methods] Methods (LLM evaluation paragraph): The fixed zero-shot prompt text is not reproduced and image rendering details (resolution, single-page layout parameters, grid vs. grid-free specifics) are omitted; this directly affects evaluation of the skeptic concern that the ~0.5 ROC-AUC may partly reflect prompt or rendering artifacts rather than inherent limitation, undermining the generalization to 'clinically framed, domain-specific architectures remain essential'.

    Authors: The exact zero-shot prompt will be reproduced in the Methods. Image rendering parameters (resolution, single-page layout, grid vs. grid-free details) will be added to allow assessment of potential artifacts and strengthen the generalization argument. revision: yes

  3. Referee: [Results] Results: The statement that LeadGroupECG 'significantly improved over its backbone internally' lacks reported p-values, confidence intervals, or effect sizes, and the external generalization claim cannot be fully evaluated without sample sizes or per-class metrics.

    Authors: We will add p-values, confidence intervals, and effect sizes for the internal improvement. External PTB-XL results will include sample sizes and per-class metrics (e.g., sensitivity, specificity) to fully support the generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with measured metrics

full rationale

The paper reports experimental results from training and evaluating CNN models (including a proposed LeadGroupECG variant) and testing zero-shot multimodal LLMs on binary ECG classification, using internal held-out data and external PTB-XL validation. All claims rest on observed ROC-AUC, PR-AUC and related performance numbers rather than any derivation, equation, fitted parameter renamed as prediction, or self-citation chain. No load-bearing step reduces to its own inputs by construction; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard machine learning evaluation assumptions and the representativeness of the chosen datasets; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • domain assumption ROC-AUC and PR-AUC are suitable metrics for assessing binary ECG image classification performance.
    Common practice in medical image classification tasks.

pith-pipeline@v0.9.1-grok · 5866 in / 1490 out tokens · 29791 ms · 2026-06-26T09:25:20.357236+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 37 canonical work pages · 3 internal anchors

  1. [1]

    Artificial intelligence -enhanced electrocardiography in cardiovascular disease management

    Siontis KC, Noseworthy PA, Attia ZI, Friedman PA. Artificial intelligence -enhanced electrocardiography in cardiovascular disease management. Nat Rev Cardiol. 2021;18:465 –78. https://doi.org/10.1038/s41569-020-00503-2

  2. [2]

    Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Dieter Kreiseler, Fatima I

    Strodthoff N, Wagner P, Schaeffter T, Samek W. Deep Learning for ECG Analysis: Benchmarks and Insights from PTB -XL. IEEE J Biomed Health Inform. 2021;25:1519 –28. https://doi.org/10.1109/JBHI.2020.3022989

  3. [3]

    Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network

    Hannun AY , Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med. 2019;25:65–9. https://doi.org/10.1038/s41591-018-0268-3

  4. [4]

    Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks

    Rajpurkar P, Hannun AY , Haghpanahi M, Bourn C, Ng AY . Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. 2017. https://doi.org/10.48550/arXiv.1707.01836

  5. [5]

    A lightweight 1D convolutional neural network model for arrhythmia diagnosis from electrocardiogram signal

    Saha Tchinda B, Tchiotsop D. A lightweight 1D convolutional neural network model for arrhythmia diagnosis from electrocardiogram signal. Phys Eng Sci Med. 2025;48:577 –89. https://doi.org/10.1007/s13246-025-01525-1

  6. [6]

    Deep learning for ECG classification: A comparative study of 1D and 2D representations and multimodal fusion approaches

    Narotamo H, Dias M, Santos R, Carreiro A V , Gamboa H, Silveira M. Deep learning for ECG classification: A comparative study of 1D and 2D representations and multimodal fusion approaches. Biomed Signal Process Control. 2024;93:106141. https://doi.org/10.1016/j.bspc.2024.106141

  7. [7]

    Automatic diagnosis of the 12 -lead ECG using a deep neural network

    Ribeiro AH, Ribeiro MH, Paixão GMM, Oliveira DM, Gomes PR, Canazart JA, et al. Automatic diagnosis of the 12 -lead ECG using a deep neural network. Nat Commun. 2020;11:1760. https://doi.org/10.1038/s41467-020-15432-4. K. Ahammad et al. 21

  8. [8]

    Deep learning approach for active classification of electrocardiogram signals

    Rahhal MMA, Bazi Y , AlHichri H, Alajlan N, Melgani F, Yager RR. Deep learning approach for active classification of electrocardiogram signals. Inf Sci. 2016;345:340 –54. https://doi.org/10.1016/j.ins.2016.01.082

  9. [9]

    ECG -based multi-class arrhythmia detection using spatio -temporal attention -based convolutional recurrent neural network

    Zhang J, Liu A, Gao M, Chen X, Zhang X, Chen X. ECG -based multi-class arrhythmia detection using spatio -temporal attention -based convolutional recurrent neural network. Artif Intell Med. 2020;106:101856. https://doi.org/10.1016/j.artmed.2020.101856

  10. [10]

    12-Lead ECG signal classification for detecting ECG arrhythmia via an information bottleneck -based multi -scale network

    Zhang S, Lian C, Xu B, Su Y , Alhudhaif A. 12-Lead ECG signal classification for detecting ECG arrhythmia via an information bottleneck -based multi -scale network. Inf Sci. 2024;662:120239. https://doi.org/10.1016/j.ins.2024.120239

  11. [11]

    ECG signal classification based on deep CNN and BiLSTM

    Cheng J, Zou Q, Zhao Y . ECG signal classification based on deep CNN and BiLSTM. BMC Med Inform Decis Mak. 2021;21:365. https://doi.org/10.1186/s12911-021-01736-y

  12. [12]

    A Deep-Learning Approach to ECG Classification Based on Adversarial Domain Adaptation

    Niu L, Chen C, Liu H, Zhou S, Shu M. A Deep-Learning Approach to ECG Classification Based on Adversarial Domain Adaptation. Healthcare. 2020;8:437. https://doi.org/10.3390/healthcare8040437

  13. [13]

    Clinically meaningful interpretability of an AI model for ECG classification

    Gliner V , Levy I, Tsutsui K, Acha MR, Schliamser J, Schuster A, et al. Clinically meaningful interpretability of an AI model for ECG classification. Npj Digit Med. 2025;8:109. https://doi.org/10.1038/s41746-025-01467-8

  14. [14]

    Impact of ECG data format on the performance of machine learning models for the prediction of myocardial infarction

    Bellfield RAA, Ortega -Martorell S, Lip GYH, Oxborough D, Olier I. Impact of ECG data format on the performance of machine learning models for the prediction of myocardial infarction. J Electrocardiol. 2024;84:17–26. https://doi.org/10.1016/j.jelectrocard.2024.03.005

  15. [15]

    FM-ECG: A fine-grained multi-label framework for ECG image classification

    Du N, Cao Q, Yu L, Liu N, Zhong E, Liu Z, et al. FM-ECG: A fine-grained multi-label framework for ECG image classification. Inf Sci. 2021;549:164–77. https://doi.org/10.1016/j.ins.2020.10.014

  16. [16]

    Electrocardiogram image classification for six classes of heart diseases

    Oke OA, Cavus N. Electrocardiogram image classification for six classes of heart diseases. Iran J Comput Sci. 2025;8:419–39. https://doi.org/10.1007/s42044-025-00227-x

  17. [17]

    Enhancing cardiac diagnostics: a deep learning ensemble approach for precise ECG image classification

    Alsayat A, Mahmoud AA, Alanazi S, Mostafa AM, Alshammari N, Alrowaily MA, et al. Enhancing cardiac diagnostics: a deep learning ensemble approach for precise ECG image classification. J Big Data. 2025;12:7. https://doi.org/10.1186/s40537-025-01070-4

  18. [18]

    Image based deep learning in 12 -lead ECG diagnosis

    Ao R, He G. Image based deep learning in 12 -lead ECG diagnosis. Front Artif Intell. 2023;5:1087370. https://doi.org/10.3389/frai.2022.1087370

  19. [19]

    Leads, Axis, and Acquisition of the 12 -Lead ECG

    Dingler A, Vandeventer S, Borkosky J, Henrichs B, McConachie A, Muthersbaugh HC, et al. Leads, Axis, and Acquisition of the 12 -Lead ECG. In: The 12 -Lead ECG in Acute Coronary Syndromes. Elsevier; 2019

  20. [20]

    Recommendations for the Standardization and Interpretation of the Electrocardiogram

    Kligfield P, Gettes LS, Bailey JJ, Childers R, Deal BJ, Hancock EW, et al. Recommendations for the Standardization and Interpretation of the Electrocardiogram. J Am Coll Cardiol. 2007;49:1109–27. https://doi.org/10.1016/j.jacc.2007.01.024

  21. [21]

    Generative Artificial Intelligence in Healthcare: Applications, Implementation Challenges, and Future Directions

    Rabbani SA, El-Tanani M, Sharma S, Rabbani SS, El-Tanani Y , Kumar R, et al. Generative Artificial Intelligence in Healthcare: Applications, Implementation Challenges, and Future Directions. BioMedInformatics. 2025;5:37. https://doi.org/10.3390/biomedinformatics5030037

  22. [22]

    Effectiveness of the GPT-4o Model in Interpreting Electrocardiogram Images for Cardiac Diagnostics: Diagnostic Accuracy Study

    Engelstein H, Ramon-Gonen R, Sabbag A, Klang E, Sudri K, Cohen-Shelly M, et al. Effectiveness of the GPT-4o Model in Interpreting Electrocardiogram Images for Cardiac Diagnostics: Diagnostic Accuracy Study. JMIR AI. 2025;4:e74426–e74426. https://doi.org/10.2196/74426

  23. [23]

    ECG-Doctor: An Interpretable Multimodal ECG Diagnosis Framework Based on Large Language Models

    Tian D, Jiang J, Zhang K, Liu C, Yuan Y , Gao M, et al. ECG-Doctor: An Interpretable Multimodal ECG Diagnosis Framework Based on Large Language Models. In: Proceedings of the 34th ACM International Conference on Information and Knowledge Management. Seo ul Republic of Korea: ACM; 2025. p. 2863–73. https://doi.org/10.1145/3746252.3761082

  24. [24]

    ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text

    Yu H, Guo P, Sano A. ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text. 2024. https://doi.org/10.48550/arXiv.2405.19366. Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study 22

  25. [25]

    Zero-shot ecg classification with multimodal learning and test-time clinical knowledge enhancement

    Liu C, Wan Z, Ouyang C, Shah A, Bai W, Arcucci R. Zero-Shot ECG Classification with Multimodal Learning and Test -time Clinical Knowledge Enhancement. 2024. https://doi.org/10.48550/arXiv.2403.06659

  26. [26]

    Zero -Shot ECG Diagnosis with Large Language Models and Retrieval - Augmented Generation

    Yu H, Guo P, Sano A. Zero -Shot ECG Diagnosis with Large Language Models and Retrieval - Augmented Generation

  27. [27]

    URL https://spj

    Yang K, Hong M, Zhang J, Luo Y , Zhao S, Zhang O, et al. ECG -LM: Understanding Electrocardiogram with a Large Language Model. Health Data Sci. 2025;5:0221. https://doi.org/10.34133/hds.0221

  28. [28]

    https://doi.org/10.1007/978-1-4842-4470-8_7

    Ekaba B. Google Colaboratory. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform. Berkeley, CA: Apress. https://doi.org/10.1007/978-1-4842-4470-8_7

  29. [29]

    A large -scale multi -label 12 -lead electrocardiogram database with standardized diagnostic statements

    Liu H, Chen D, Chen D, Zhang X, Li H, Bian L, et al. A large -scale multi -label 12 -lead electrocardiogram database with standardized diagnostic statements. Sci Data. 2022;9:272. https://doi.org/10.1038/s41597-022-01403-5

  30. [30]

    PTB -XL, a large publicly available electrocardiography dataset

    Wagner P, Strodthoff N, Bousseljot R-D, Kreiseler D, Lunze FI, Samek W, et al. PTB -XL, a large publicly available electrocardiography dataset. Sci Data. 2020;7:154. https://doi.org/10.1038/s41597 - 020-0495-6

  31. [31]

    CNN -FWS: A Model for the Diagnosis of Normal and Abnormal ECG with Feature Adaptive

    Zhu J, Lv J, Kong D. CNN -FWS: A Model for the Diagnosis of Normal and Abnormal ECG with Feature Adaptive. Entropy. 2022;24:471. https://doi.org/10.3390/e24040471

  32. [32]

    Deep Residual Learning for Image Recognition

    He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2015. https://doi.org/10.48550/arXiv.1512.03385

  33. [33]

    Densely Connected Convolutional Networks

    Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE

  34. [34]

    p. 2261–9. https://doi.org/10.1109/CVPR.2017.243

  35. [35]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition

  36. [36]

    https://doi.org/10.48550/arXiv.1409.1556

  37. [37]

    Analysis of an adaptive lead weighted ResNet for multiclass classification of 12 -lead ECGs

    Zhao Z, Murphy D, Gifford H, Williams S, Darlington A, Relton SD, et al. Analysis of an adaptive lead weighted ResNet for multiclass classification of 12 -lead ECGs. Physiol Meas. 2022;43:034001. https://doi.org/10.1088/1361-6579/ac5b4a

  38. [38]

    ECG Heartbeat Classification Based on an Improved ResNet-18 Model

    Jing E, Zhang H, Li Z, Liu Y , Ji Z, Ganchev I. ECG Heartbeat Classification Based on an Improved ResNet-18 Model. Comput Math Methods Med. 2021;2021:1 –13. https://doi.org/10.1155/2021/6649970

  39. [39]

    Automatic varied -length ECG classification using a lightweight DenseNet model

    Bui TH, Hoang VM, Pham MT. Automatic varied -length ECG classification using a lightweight DenseNet model. Biomed Signal Process Control. 2023;82:104529. https://doi.org/10.1016/j.bspc.2022.104529

  40. [40]

    Deep convolutional neural networks based ECG beats classification to diagnose cardiovascular conditions

    Rashed -Al-Mahfuz Md, Moni MA, Lio’ P, Islam SMS, Berkovsky S, Khushi M, et al. Deep convolutional neural networks based ECG beats classification to diagnose cardiovascular conditions. Biomed Eng Lett. 2021;11:147–62. https://doi.org/10.1007/s13534-021-00185-w

  41. [41]

    ImageNet: A large-scale hierarchical image database

    Deng J, Dong W, Socher R, Li L-J, Kai Li, Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL: IEEE

  42. [42]

    p. 248–55. https://doi.org/10.1109/CVPR.2009.5206848