pith. sign in

arxiv: 2605.24588 · v1 · pith:Q5GDS6JVnew · submitted 2026-05-23 · 💻 cs.AI · cs.LG

HeartBeatAI: An Interpretable and Robust Deep Learning Framework for Multi-Label ECG Arrhythmia Detection

Pith reviewed 2026-06-30 13:56 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords ECG arrhythmia detectiondeep learningdomain generalizationmulti-label classification12-lead ECGMixStyle regularizationSE ResNetinterpretability
0
0 comments X

The pith

HeartBeatAI reaches 98% Macro F1-score for multi-label ECG arrhythmia detection within datasets but degrades for rare anomalies under domain shift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HeartBeatAI as a deep learning framework for classifying multiple arrhythmias from 12-lead ECG recordings. It combines a Squeeze-and-Excitation ResNet to focus on diagnostic leads, a Multi-Layer Concentration Pipeline to aggregate features across scales, and regularization steps including MixStyle and label smoothing to improve generalization. Rigorous tests on four large datasets yield strong results when training and testing data come from the same source, yet Leave-One-Domain-Out evaluations expose clear drops especially for uncommon anomaly classes. The work matters to a sympathetic reader because it shows concrete performance numbers alongside evidence that cross-institution deployment remains difficult despite these techniques.

Core claim

By integrating domain generalization methods with multi-scale feature extraction and explainability components, HeartBeatAI achieves a 98% Macro F1-score in intra-source evaluations on multiple ECG datasets for multi-label arrhythmia classification, yet evaluations using Leave-One-Domain-Out protocols indicate substantial degradation particularly in identifying infrequent anomalies, underscoring ongoing difficulties in achieving robust cross-institutional performance.

What carries the argument

The Squeeze-and-Excitation ResNet paired with a Multi-Layer Concentration Pipeline that isolates diagnostic leads and captures both macro-rhythm and micro-morphological anomalies.

If this is right

  • The framework reliably handles simultaneous multi-label arrhythmia classification when data distributions match between training and test sets.
  • MixStyle regularization and label smoothing reduce but do not eliminate degradation on rare classes during domain-shift tests.
  • Inclusion of clinical explainability components supports potential use in medical settings.
  • LODO results indicate that further advances are needed before reliable deployment across different recording sites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training on ECG recordings drawn from a broader set of institutions could narrow the performance gap seen in LODO tests.
  • The lead-isolation and multi-scale pipeline could transfer to classification tasks on other time-series biosignals that face similar distribution shifts.
  • Detailed per-anomaly breakdowns from the LODO runs could reveal which specific rare classes drive most of the cross-domain loss.

Load-bearing premise

That the four datasets and the LODO protocol sufficiently capture real-world domain shifts between institutions and that observed performance drops stem mainly from those shifts rather than label noise or other factors.

What would settle it

Running the same framework on new ECG collections from additional institutions and finding no drop in Macro F1-score for rare anomalies would challenge the claim that domain shift creates persistent cross-institutional challenges.

read the original abstract

While Deep Learning (DL) enhances automated electrocardiogram (ECG) analysis, clinical deployment is hindered by class imbalance and the generalization gap. This paper presents HeartBeatAI, a deep learning framework combining domain generalization, multi-scale feature aggregation, and clinical explainability for robust 12-lead ECG classification. Moving beyond image-based paradigms, HeartBeatAI integrates a Squeeze-and-Excitation (SE) ResNet to isolate diagnostic leads alongside a Multi-Layer Concentration Pipeline to capture macro-rhythm and micro-morphological anomalies. To mitigate domain shift, the framework employs MixStyle regularization and Label Smoothing. Rigorous benchmarking across four large-scale datasets using intra-source and Leave-One-Domain-Out (LODO) protocols demonstrates high performance (98% Macro F1-score) under intra-source conditions. However, LODO evaluations reveal significant degradation in detecting rare anomalies, highlighting a persistent challenge in cross-institutional deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces HeartBeatAI, a deep learning framework for multi-label 12-lead ECG arrhythmia detection. It combines a SE-ResNet with multi-scale feature aggregation via a Multi-Layer Concentration Pipeline, incorporates MixStyle regularization and Label Smoothing for domain generalization, and aims for clinical explainability. Benchmarking on four large-scale datasets shows 98% Macro F1-score in intra-source settings but significant degradation in Leave-One-Domain-Out (LODO) evaluations, particularly for rare anomalies, underscoring challenges in cross-institutional deployment.

Significance. If the empirical results can be verified with full methodological details, the framework could advance robust and interpretable ECG analysis for clinical use by addressing domain shift and class imbalance. However, the current presentation lacks the necessary details to assess its contribution relative to existing methods.

major comments (2)
  1. Abstract: The reported 98% Macro F1-score under intra-source conditions is presented without any baselines, statistical tests, implementation details, or error analysis, rendering the central performance claim unverifiable.
  2. LODO protocol description: The attribution of performance degradation in LODO to domain shift between institutions lacks supporting information on dataset sizes, per-class frequencies, annotation protocols, or acquisition parameters; without these, alternative explanations such as label noise or varying class imbalance cannot be ruled out.
minor comments (1)
  1. Abstract: The term 'large-scale datasets' is used without specifying the actual dataset names or sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: Abstract: The reported 98% Macro F1-score under intra-source conditions is presented without any baselines, statistical tests, implementation details, or error analysis, rendering the central performance claim unverifiable.

    Authors: We agree that the abstract's brevity limits inclusion of supporting elements. The full manuscript (Sections 3 and 4) provides the requested details: comparisons to multiple baselines, paired statistical tests, implementation hyperparameters, and per-class error breakdowns. To improve verifiability of the headline claim, we will revise the abstract to briefly note the intra-source benchmarking protocol and reference to state-of-the-art comparisons. revision: yes

  2. Referee: LODO protocol description: The attribution of performance degradation in LODO to domain shift between institutions lacks supporting information on dataset sizes, per-class frequencies, annotation protocols, or acquisition parameters; without these, alternative explanations such as label noise or varying class imbalance cannot be ruled out.

    Authors: We concur that expanded dataset characterization is needed to strengthen the domain-shift interpretation. The current manuscript references the four public datasets and their source publications but does not tabulate the requested statistics in the LODO section. We will add an explicit table (or expanded subsection) listing dataset sizes, per-class frequencies across domains, annotation sources, and acquisition parameters to allow readers to evaluate alternative explanations such as label noise or imbalance differences. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with no derivations or self-referential reductions

full rationale

The paper reports direct experimental outcomes from training and evaluating a DL model (SE-ResNet + Multi-Layer Concentration Pipeline + MixStyle + Label Smoothing) on four datasets under intra-source and LODO protocols, yielding measured metrics such as 98% Macro F1. No equations, parameter-fitting steps presented as predictions, uniqueness theorems, or self-citation chains appear in the abstract or described content. All claims are falsifiable performance statements grounded in external data splits rather than reducing to inputs by construction. The absence of any derivation chain makes circularity analysis inapplicable; the reader's assigned score of 2 reflects this lack of mathematical structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard deep learning assumptions such as the ability of neural networks to learn discriminative features from labeled ECG data and the validity of LODO as a proxy for domain shift. No new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5696 in / 1076 out tokens · 36022 ms · 2026-06-30T13:56:01.627923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 24 canonical work pages

  1. [1]

    WHO Fact Sheets

    World Health Organization: Cardiovascular diseases (CVDs). WHO Fact Sheets. https://www.who.int/news-room/fact-sheets/detail/ cardiovascular-diseases-(cvds) (2021)

  2. [2]

    Siontis, K.C.,et al.: Artificial intelligence-enhanced electrocardiography in car- diovascular disease management. Nat. Rev. Cardiol.18, 465–478 (2021). https: //doi.org/10.1038/s41569-020-00503-2

  3. [3]

    Hong, S.,et al.: Opportunities and challenges in deep learning methods on electro- cardiogram data: A systematic review. Comput. Biol. Med.122, 103801 (2020). https://doi.org/10.1016/j.compbiomed.2020.103801

  4. [4]

    Jin, Y., Li, Z., Wang, M., et al.: Cardiologist-level interpretable knowledge-fused deep neural network for automatic arrhythmia diagnosis. Commun. Med.4(31) 23 (2024). https://doi.org/10.1038/s43856-024-00464-4

  5. [5]

    Ribeiro, A.H.,et al.: Automatic diagnosis of the 12-lead ecg using a deep neural network. Nat. Commun.11, 1760 (2020). https://doi.org/10.1038/ s41467-020-15432-4

  6. [6]

    Hannun, A.Y.,et al.: Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat. Med.25, 65–69 (2019). https://doi.org/10.1038/s41591-018-0268-3

  7. [7]

    In: Comput- ing in Cardiology (CinC) 2021 (2021)

    Li, X., Li, C., Xu, X., Wei, Y., Wei, J., Sun, Y., Qian, B., Xu, X.: Towards generalization of cardiac abnormality classification using ecg signal. In: Comput- ing in Cardiology (CinC) 2021 (2021). https://www.cinc.org/archives/2021/pdf/ CinC2021-212.pdf

  8. [8]

    IEEE Transactions on Biomedical Engineering 71(2), 641–652 (2024)

    Ballas, A., Diou, C.: Towards domain generalization for ecg and eeg classifica- tion: Algorithms and benchmarks. IEEE Transactions on Biomedical Engineering 71(2), 641–652 (2024). https://ieeexplore.ieee.org/document/10233054

  9. [9]

    IEEE Trans

    Dissanayake, T., Fernando, T., Denman, S., Ghaemmaghami, H., Sridharan, S., Fookes, C.: Domain generalization in biosignal classification. IEEE Trans. Biomed. Eng.68(6), 1978–1989 (2021). https://arxiv.org/pdf/2011.06207

  10. [10]

    Neurocomputing 349, 212–224 (2019)

    Wang, J.,et al.: Adversarial de-noising of electrocardiogram. Neurocomputing 349, 212–224 (2019). https://doi.org/10.1016/j.neucom.2019.04.041

  11. [11]

    In: 2020 Computing in Car- diology, pp

    Hasani, H., Bitarafan, A., Baghshah, M.S.: Classification of 12-lead ecg signals with adversarial multi-source domain generalization. In: 2020 Computing in Car- diology, pp. 1–4 (2020). https://www.cinc.org/archives/2020/pdf/CinC2020-445. pdf

  12. [12]

    Alday, E.A.P.,et al.: Classification of 12-lead ecgs: the physionet/computing in cardiology challenge 2020. Physiol. Meas.41, 124003 (2020). https://doi.org/10. 1088/1361-6579/abc960

  13. [13]

    Liu, F.,et al.: An open access database for evaluating the algorithms of electro- cardiogram rhythm and morphology abnormality detection. J. Med. Imag. Health Inform.8, 1368–1373 (2018). https://doi.org/10.1166/jmihi.2018.2442

  14. [14]

    Wagner, P.,et al.: Ptb-xl, a large publicly available electrocardiography dataset. Sci. Data7, 154 (2020). https://doi.org/10.1038/s41597-020-0495-6

  15. [15]

    Zheng, J.,et al.: A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. Sci. Data7, 48 (2020). https://doi.org/10. 1038/s41597-020-0386-x

  16. [16]

    In: Proc

    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proc. IEEE Conf. 24 Comput. Vis. Pattern Recognit. (CVPR), pp. 7132–7141 (2018). https://doi.org/ 10.1109/CVPR.2018.00745

  17. [17]

    Zhou, K., Yang, Y., Qiao, Y., Xiang, T.: Domain generalization with mixstyle. In: Int. Conf. Learn. Represent. (ICLR) (2021). https://arxiv.org/abs/2104.02008

  18. [18]

    In: Proc

    Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. In: Proc. 9th Int. Conf. Learn. Represent. (ICLR), pp. 1–26 (2021). https://openreview. net/forum?id=lQdXeXDoWtI

  19. [19]

    Sangha, V.,et al.: Automated multilabel diagnosis on electrocardiographic images and signals. Nat. Commun.13, 1583 (2022). https://doi.org/10.1038/ s41467-022-29153-3

  20. [20]

    Strodthoff, N.,et al.: Deep learning for ecg analysis: Benchmarks and insights from ptb-xl. IEEE J. Biomed. Health Inform.25, 1519–1528 (2021). https://doi. org/10.1109/JBHI.2020.3022989

  21. [21]

    Lancet394, 861–867 (2019)

    Attia, Z.I.,et al.: An artificial intelligence-enabled ecg algorithm for the identi- fication of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet394, 861–867 (2019). https://doi.org/10. 1016/S0140-6736(19)31721-0

  22. [22]

    Lai, J., Tan, H., Wang, J., et al.: Practical intelligent diagnostic algorithm for wearable 12-lead ecg via self-supervised learning on large-scale dataset. Nat. Commun.14(3741) (2023). https://doi.org/10.1038/s41467-023-39472-8

  23. [23]

    Topol, E.J.: High-performance medicine: the convergence of human and artificial intelligence. Nat. Med.25, 44–56 (2019). https://doi.org/10.1038/ s41591-018-0300-7

  24. [24]

    Deep residual learning for image recognition

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  25. [25]

    In: Proc

    Huang, G.,et al.: Densely connected convolutional networks. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 4700–4708 (2017). https: //doi.org/10.1109/CVPR.2017.243

  26. [26]

    Journal of Clinical and Biomedi- cal Sciences15, 118–125 (2025)

    Febeena, K.R., Kurian, C.: Advanced arrhythmia classification using transformer-based cnn. Journal of Clinical and Biomedi- cal Sciences15, 118–125 (2025). https://jcbsonline.ac.in/articles/ advanced-arrhythmia-classification-using-transformer-based-cnn

  27. [27]

    IEEE Trans

    Wang, J.,et al.: Generalizing to unseen domains: A survey on domain gen- eralization. IEEE Trans. Knowl. Data Eng.35(8), 8052–8072 (2022). https: //arxiv.org/abs/2103.03097 25

  28. [28]

    In: Proc

    Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropaga- tion. In: Proc. Int. Conf. Mach. Learn. (ICML), pp. 1180–1189 (2015). https: //proceedings.mlr.press/v37/ganin15.html

  29. [29]

    Goettling, M.,et al.: xecgarch: a trustworthy deep learning architecture for inter- pretable ecg analysis considering short-term and long-term features. Sci. Rep.14, 13122 (2024). https://doi.org/10.1038/s41598-024-63656-x

  30. [30]

    In: Proc

    Szegedy, C.,et al.: Rethinking the inception architecture for computer vision. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 2818–2826 (2016). https://doi.org/10.1109/CVPR.2016.308

  31. [31]

    Zhou, F., Fang, D.: Classification of multi-lead ecg based on multiple scales and hierarchical feature convolutional neural networks. Sci. Rep.15, 16418 (2025). https://doi.org/10.1038/s41598-025-94127-6

  32. [32]

    Jang, J.H.,et al.: A novel xai framework for explainable ai-ecg using generative counterfactual xai (gcx). Sci. Rep.15, 23608 (2025). https://doi.org/10.1038/ s41598-025-08080-5

  33. [33]

    In: NeurIPS 2024 Proceedings (2024)

    Bedin, L., Cardoso, G., Duchateau, J., Dubois, R., Moulines, E.: Leveraging an ecg beat diffusion model for morphological reconstruction from indirect signals. In: NeurIPS 2024 Proceedings (2024). https://proceedings.neurips.cc/paper files/ paper/2024/file/9988f2c8e07c1f98af7ba9ca31ccae0b-Paper-Conference.pdf

  34. [34]

    IEEE Trans

    Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw.10, 988–999 (1999). https://doi.org/10.1109/72.788640

  35. [35]

    npj Cardiovasc

    Lai, J., Zhang, Y., Zhao, C., et al.: Multi-expert ensemble ecg diagnostic algo- rithm using mutually exclusive–symbiotic correlation between 254 hierarchical multiple labels. npj Cardiovasc. Health1(8) (2024). https://doi.org/10.1038/ s44325-024-00010-0

  36. [36]

    European Heart Journal40, 237–269 (2019)

    Thygesen, K.,et al.: Fourth universal definition of myocardial infarction (2018). European Heart Journal40, 237–269 (2019). https://doi.org/10.1093/eurheartj/ ehy462

  37. [37]

    IEEE Trans

    Zhou, K.,et al.: Domain generalization: A survey. IEEE Trans. Pattern Anal. Mach. Intell.45, 4396–4415 (2022). https://doi.org/10.1109/TPAMI.2022. 3195549

  38. [38]

    In: Proc

    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proc. Int. Conf. Mach. Learn. (ICML), pp. 448–456 (2015). https://proceedings.mlr.press/v37/ioffe15.html

  39. [39]

    Srivastava, N.,et al.: Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.15, 1929–1958 (2014). http://jmlr.org/papers/ 26 v15/srivastava14a.html

  40. [40]

    Paszke, A.,et al.: Pytorch: An imperative style, high-performance deep learning library. In: Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, pp. 8024–8035 (2019). https://proceedings.neurips.cc/paper/2019/hash/ bdbca288fee7f92f2bfa9f7012727740-Abstract.html

  41. [41]

    In: Proc

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proc. 7th Int. Conf. Learn. Represent. (ICLR), pp. 1–18 (2019). https://openreview.net/ forum?id=Bkg6RiCqY7

  42. [42]

    arXiv preprint arXiv:2009.14119 (2020)

    Ben-Baruch, E., et al.: Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119 (2020). https://doi.org/10.48550/arXiv.2009.14119

  43. [43]

    Dosovitskiy, A.,et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Int. Conf. Learn. Represent. (ICLR) (2021). https:// openreview.net/forum?id=YicbFdNTTy

  44. [44]

    arXiv preprint arXiv:2411.00755 (2024)

    Tang, X., et al.: Hierarchical transformer for electrocardiogram diagnosis. arXiv preprint arXiv:2411.00755 (2024). https://doi.org/10.48550/arXiv.2411.00755 27