HeartBeatAI: An Interpretable and Robust Deep Learning Framework for Multi-Label ECG Arrhythmia Detection
Pith reviewed 2026-06-30 13:56 UTC · model grok-4.3
The pith
HeartBeatAI reaches 98% Macro F1-score for multi-label ECG arrhythmia detection within datasets but degrades for rare anomalies under domain shift.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By integrating domain generalization methods with multi-scale feature extraction and explainability components, HeartBeatAI achieves a 98% Macro F1-score in intra-source evaluations on multiple ECG datasets for multi-label arrhythmia classification, yet evaluations using Leave-One-Domain-Out protocols indicate substantial degradation particularly in identifying infrequent anomalies, underscoring ongoing difficulties in achieving robust cross-institutional performance.
What carries the argument
The Squeeze-and-Excitation ResNet paired with a Multi-Layer Concentration Pipeline that isolates diagnostic leads and captures both macro-rhythm and micro-morphological anomalies.
If this is right
- The framework reliably handles simultaneous multi-label arrhythmia classification when data distributions match between training and test sets.
- MixStyle regularization and label smoothing reduce but do not eliminate degradation on rare classes during domain-shift tests.
- Inclusion of clinical explainability components supports potential use in medical settings.
- LODO results indicate that further advances are needed before reliable deployment across different recording sites.
Where Pith is reading between the lines
- Training on ECG recordings drawn from a broader set of institutions could narrow the performance gap seen in LODO tests.
- The lead-isolation and multi-scale pipeline could transfer to classification tasks on other time-series biosignals that face similar distribution shifts.
- Detailed per-anomaly breakdowns from the LODO runs could reveal which specific rare classes drive most of the cross-domain loss.
Load-bearing premise
That the four datasets and the LODO protocol sufficiently capture real-world domain shifts between institutions and that observed performance drops stem mainly from those shifts rather than label noise or other factors.
What would settle it
Running the same framework on new ECG collections from additional institutions and finding no drop in Macro F1-score for rare anomalies would challenge the claim that domain shift creates persistent cross-institutional challenges.
read the original abstract
While Deep Learning (DL) enhances automated electrocardiogram (ECG) analysis, clinical deployment is hindered by class imbalance and the generalization gap. This paper presents HeartBeatAI, a deep learning framework combining domain generalization, multi-scale feature aggregation, and clinical explainability for robust 12-lead ECG classification. Moving beyond image-based paradigms, HeartBeatAI integrates a Squeeze-and-Excitation (SE) ResNet to isolate diagnostic leads alongside a Multi-Layer Concentration Pipeline to capture macro-rhythm and micro-morphological anomalies. To mitigate domain shift, the framework employs MixStyle regularization and Label Smoothing. Rigorous benchmarking across four large-scale datasets using intra-source and Leave-One-Domain-Out (LODO) protocols demonstrates high performance (98% Macro F1-score) under intra-source conditions. However, LODO evaluations reveal significant degradation in detecting rare anomalies, highlighting a persistent challenge in cross-institutional deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HeartBeatAI, a deep learning framework for multi-label 12-lead ECG arrhythmia detection. It combines a SE-ResNet with multi-scale feature aggregation via a Multi-Layer Concentration Pipeline, incorporates MixStyle regularization and Label Smoothing for domain generalization, and aims for clinical explainability. Benchmarking on four large-scale datasets shows 98% Macro F1-score in intra-source settings but significant degradation in Leave-One-Domain-Out (LODO) evaluations, particularly for rare anomalies, underscoring challenges in cross-institutional deployment.
Significance. If the empirical results can be verified with full methodological details, the framework could advance robust and interpretable ECG analysis for clinical use by addressing domain shift and class imbalance. However, the current presentation lacks the necessary details to assess its contribution relative to existing methods.
major comments (2)
- Abstract: The reported 98% Macro F1-score under intra-source conditions is presented without any baselines, statistical tests, implementation details, or error analysis, rendering the central performance claim unverifiable.
- LODO protocol description: The attribution of performance degradation in LODO to domain shift between institutions lacks supporting information on dataset sizes, per-class frequencies, annotation protocols, or acquisition parameters; without these, alternative explanations such as label noise or varying class imbalance cannot be ruled out.
minor comments (1)
- Abstract: The term 'large-scale datasets' is used without specifying the actual dataset names or sizes.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point-by-point below and indicate the revisions we will make.
read point-by-point responses
-
Referee: Abstract: The reported 98% Macro F1-score under intra-source conditions is presented without any baselines, statistical tests, implementation details, or error analysis, rendering the central performance claim unverifiable.
Authors: We agree that the abstract's brevity limits inclusion of supporting elements. The full manuscript (Sections 3 and 4) provides the requested details: comparisons to multiple baselines, paired statistical tests, implementation hyperparameters, and per-class error breakdowns. To improve verifiability of the headline claim, we will revise the abstract to briefly note the intra-source benchmarking protocol and reference to state-of-the-art comparisons. revision: yes
-
Referee: LODO protocol description: The attribution of performance degradation in LODO to domain shift between institutions lacks supporting information on dataset sizes, per-class frequencies, annotation protocols, or acquisition parameters; without these, alternative explanations such as label noise or varying class imbalance cannot be ruled out.
Authors: We concur that expanded dataset characterization is needed to strengthen the domain-shift interpretation. The current manuscript references the four public datasets and their source publications but does not tabulate the requested statistics in the LODO section. We will add an explicit table (or expanded subsection) listing dataset sizes, per-class frequencies across domains, annotation sources, and acquisition parameters to allow readers to evaluate alternative explanations such as label noise or imbalance differences. revision: yes
Circularity Check
No circularity: purely empirical benchmarking with no derivations or self-referential reductions
full rationale
The paper reports direct experimental outcomes from training and evaluating a DL model (SE-ResNet + Multi-Layer Concentration Pipeline + MixStyle + Label Smoothing) on four datasets under intra-source and LODO protocols, yielding measured metrics such as 98% Macro F1. No equations, parameter-fitting steps presented as predictions, uniqueness theorems, or self-citation chains appear in the abstract or described content. All claims are falsifiable performance statements grounded in external data splits rather than reducing to inputs by construction. The absence of any derivation chain makes circularity analysis inapplicable; the reader's assigned score of 2 reflects this lack of mathematical structure.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
WHO Fact Sheets
World Health Organization: Cardiovascular diseases (CVDs). WHO Fact Sheets. https://www.who.int/news-room/fact-sheets/detail/ cardiovascular-diseases-(cvds) (2021)
2021
-
[2]
Siontis, K.C.,et al.: Artificial intelligence-enhanced electrocardiography in car- diovascular disease management. Nat. Rev. Cardiol.18, 465–478 (2021). https: //doi.org/10.1038/s41569-020-00503-2
-
[3]
Hong, S.,et al.: Opportunities and challenges in deep learning methods on electro- cardiogram data: A systematic review. Comput. Biol. Med.122, 103801 (2020). https://doi.org/10.1016/j.compbiomed.2020.103801
-
[4]
Jin, Y., Li, Z., Wang, M., et al.: Cardiologist-level interpretable knowledge-fused deep neural network for automatic arrhythmia diagnosis. Commun. Med.4(31) 23 (2024). https://doi.org/10.1038/s43856-024-00464-4
-
[5]
Ribeiro, A.H.,et al.: Automatic diagnosis of the 12-lead ecg using a deep neural network. Nat. Commun.11, 1760 (2020). https://doi.org/10.1038/ s41467-020-15432-4
2020
-
[6]
Hannun, A.Y.,et al.: Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat. Med.25, 65–69 (2019). https://doi.org/10.1038/s41591-018-0268-3
-
[7]
In: Comput- ing in Cardiology (CinC) 2021 (2021)
Li, X., Li, C., Xu, X., Wei, Y., Wei, J., Sun, Y., Qian, B., Xu, X.: Towards generalization of cardiac abnormality classification using ecg signal. In: Comput- ing in Cardiology (CinC) 2021 (2021). https://www.cinc.org/archives/2021/pdf/ CinC2021-212.pdf
2021
-
[8]
IEEE Transactions on Biomedical Engineering 71(2), 641–652 (2024)
Ballas, A., Diou, C.: Towards domain generalization for ecg and eeg classifica- tion: Algorithms and benchmarks. IEEE Transactions on Biomedical Engineering 71(2), 641–652 (2024). https://ieeexplore.ieee.org/document/10233054
-
[9]
Dissanayake, T., Fernando, T., Denman, S., Ghaemmaghami, H., Sridharan, S., Fookes, C.: Domain generalization in biosignal classification. IEEE Trans. Biomed. Eng.68(6), 1978–1989 (2021). https://arxiv.org/pdf/2011.06207
-
[10]
Neurocomputing 349, 212–224 (2019)
Wang, J.,et al.: Adversarial de-noising of electrocardiogram. Neurocomputing 349, 212–224 (2019). https://doi.org/10.1016/j.neucom.2019.04.041
-
[11]
In: 2020 Computing in Car- diology, pp
Hasani, H., Bitarafan, A., Baghshah, M.S.: Classification of 12-lead ecg signals with adversarial multi-source domain generalization. In: 2020 Computing in Car- diology, pp. 1–4 (2020). https://www.cinc.org/archives/2020/pdf/CinC2020-445. pdf
2020
-
[12]
Alday, E.A.P.,et al.: Classification of 12-lead ecgs: the physionet/computing in cardiology challenge 2020. Physiol. Meas.41, 124003 (2020). https://doi.org/10. 1088/1361-6579/abc960
2020
-
[13]
Liu, F.,et al.: An open access database for evaluating the algorithms of electro- cardiogram rhythm and morphology abnormality detection. J. Med. Imag. Health Inform.8, 1368–1373 (2018). https://doi.org/10.1166/jmihi.2018.2442
-
[14]
Wagner, P.,et al.: Ptb-xl, a large publicly available electrocardiography dataset. Sci. Data7, 154 (2020). https://doi.org/10.1038/s41597-020-0495-6
-
[15]
Zheng, J.,et al.: A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. Sci. Data7, 48 (2020). https://doi.org/10. 1038/s41597-020-0386-x
2020
-
[16]
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proc. IEEE Conf. 24 Comput. Vis. Pattern Recognit. (CVPR), pp. 7132–7141 (2018). https://doi.org/ 10.1109/CVPR.2018.00745
- [17]
-
[18]
In: Proc
Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. In: Proc. 9th Int. Conf. Learn. Represent. (ICLR), pp. 1–26 (2021). https://openreview. net/forum?id=lQdXeXDoWtI
2021
-
[19]
Sangha, V.,et al.: Automated multilabel diagnosis on electrocardiographic images and signals. Nat. Commun.13, 1583 (2022). https://doi.org/10.1038/ s41467-022-29153-3
2022
-
[20]
Strodthoff, N.,et al.: Deep learning for ecg analysis: Benchmarks and insights from ptb-xl. IEEE J. Biomed. Health Inform.25, 1519–1528 (2021). https://doi. org/10.1109/JBHI.2020.3022989
-
[21]
Lancet394, 861–867 (2019)
Attia, Z.I.,et al.: An artificial intelligence-enabled ecg algorithm for the identi- fication of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet394, 861–867 (2019). https://doi.org/10. 1016/S0140-6736(19)31721-0
2019
-
[22]
Lai, J., Tan, H., Wang, J., et al.: Practical intelligent diagnostic algorithm for wearable 12-lead ecg via self-supervised learning on large-scale dataset. Nat. Commun.14(3741) (2023). https://doi.org/10.1038/s41467-023-39472-8
-
[23]
Topol, E.J.: High-performance medicine: the convergence of human and artificial intelligence. Nat. Med.25, 44–56 (2019). https://doi.org/10.1038/ s41591-018-0300-7
2019
-
[24]
In: IEEE Conference on Computer Vision and Pattern Recognition
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
-
[25]
, year = 2017, month = jul, pages =
Huang, G.,et al.: Densely connected convolutional networks. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 4700–4708 (2017). https: //doi.org/10.1109/CVPR.2017.243
-
[26]
Journal of Clinical and Biomedi- cal Sciences15, 118–125 (2025)
Febeena, K.R., Kurian, C.: Advanced arrhythmia classification using transformer-based cnn. Journal of Clinical and Biomedi- cal Sciences15, 118–125 (2025). https://jcbsonline.ac.in/articles/ advanced-arrhythmia-classification-using-transformer-based-cnn
2025
-
[27]
Wang, J.,et al.: Generalizing to unseen domains: A survey on domain gen- eralization. IEEE Trans. Knowl. Data Eng.35(8), 8052–8072 (2022). https: //arxiv.org/abs/2103.03097 25
-
[28]
In: Proc
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropaga- tion. In: Proc. Int. Conf. Mach. Learn. (ICML), pp. 1180–1189 (2015). https: //proceedings.mlr.press/v37/ganin15.html
2015
-
[29]
Goettling, M.,et al.: xecgarch: a trustworthy deep learning architecture for inter- pretable ecg analysis considering short-term and long-term features. Sci. Rep.14, 13122 (2024). https://doi.org/10.1038/s41598-024-63656-x
-
[30]
Rethinking the Inception Architecture for Computer Vision
Szegedy, C.,et al.: Rethinking the inception architecture for computer vision. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 2818–2826 (2016). https://doi.org/10.1109/CVPR.2016.308
-
[31]
Zhou, F., Fang, D.: Classification of multi-lead ecg based on multiple scales and hierarchical feature convolutional neural networks. Sci. Rep.15, 16418 (2025). https://doi.org/10.1038/s41598-025-94127-6
-
[32]
Jang, J.H.,et al.: A novel xai framework for explainable ai-ecg using generative counterfactual xai (gcx). Sci. Rep.15, 23608 (2025). https://doi.org/10.1038/ s41598-025-08080-5
2025
-
[33]
In: NeurIPS 2024 Proceedings (2024)
Bedin, L., Cardoso, G., Duchateau, J., Dubois, R., Moulines, E.: Leveraging an ecg beat diffusion model for morphological reconstruction from indirect signals. In: NeurIPS 2024 Proceedings (2024). https://proceedings.neurips.cc/paper files/ paper/2024/file/9988f2c8e07c1f98af7ba9ca31ccae0b-Paper-Conference.pdf
2024
-
[34]
Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw.10, 988–999 (1999). https://doi.org/10.1109/72.788640
-
[35]
npj Cardiovasc
Lai, J., Zhang, Y., Zhao, C., et al.: Multi-expert ensemble ecg diagnostic algo- rithm using mutually exclusive–symbiotic correlation between 254 hierarchical multiple labels. npj Cardiovasc. Health1(8) (2024). https://doi.org/10.1038/ s44325-024-00010-0
2024
-
[36]
European Heart Journal40, 237–269 (2019)
Thygesen, K.,et al.: Fourth universal definition of myocardial infarction (2018). European Heart Journal40, 237–269 (2019). https://doi.org/10.1093/eurheartj/ ehy462
-
[37]
Zhou, K.,et al.: Domain generalization: A survey. IEEE Trans. Pattern Anal. Mach. Intell.45, 4396–4415 (2022). https://doi.org/10.1109/TPAMI.2022. 3195549
-
[38]
In: Proc
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proc. Int. Conf. Mach. Learn. (ICML), pp. 448–456 (2015). https://proceedings.mlr.press/v37/ioffe15.html
2015
-
[39]
Srivastava, N.,et al.: Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.15, 1929–1958 (2014). http://jmlr.org/papers/ 26 v15/srivastava14a.html
1929
-
[40]
Paszke, A.,et al.: Pytorch: An imperative style, high-performance deep learning library. In: Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, pp. 8024–8035 (2019). https://proceedings.neurips.cc/paper/2019/hash/ bdbca288fee7f92f2bfa9f7012727740-Abstract.html
2019
-
[41]
In: Proc
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proc. 7th Int. Conf. Learn. Represent. (ICLR), pp. 1–18 (2019). https://openreview.net/ forum?id=Bkg6RiCqY7
2019
-
[42]
arXiv preprint arXiv:2009.14119 (2020)
Ben-Baruch, E., et al.: Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119 (2020). https://doi.org/10.48550/arXiv.2009.14119
-
[43]
Dosovitskiy, A.,et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Int. Conf. Learn. Represent. (ICLR) (2021). https:// openreview.net/forum?id=YicbFdNTTy
2021
-
[44]
arXiv preprint arXiv:2411.00755 (2024)
Tang, X., et al.: Hierarchical transformer for electrocardiogram diagnosis. arXiv preprint arXiv:2411.00755 (2024). https://doi.org/10.48550/arXiv.2411.00755 27
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.