pith. sign in

arxiv: 2410.08559 · v5 · submitted 2024-10-11 · 💻 cs.LG · cs.AI

Learning General Representation of 12-Lead Electrocardiogram with a Joint-Embedding Predictive Architecture

Pith reviewed 2026-05-23 19:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords self-supervised learning12-lead ECGmasked modelingjoint embedding predictive architecturediagnostic classificationfeature extractionsegmentation
0
0 comments X

The pith

ECG-JEPA learns semantic 12-lead ECG representations by predicting masked tokens in latent space rather than reconstructing raw signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ECG-JEPA as a self-supervised model that performs masked modeling directly in the hidden latent space of 12-lead ECG recordings. It argues this bypasses the drawbacks of reconstructing raw waveforms, such as amplifying noise and the shortcomings of simple L2 losses. The model is pre-trained on roughly 180,000 unlabeled samples drawn from multiple public ECG collections. A specialized Cross-Pattern Attention module is added to respect the multi-lead structure. The resulting representations reach state-of-the-art results on downstream diagnostic classification, feature extraction, and segmentation tasks.

Core claim

Masked modeling in the latent space can be a powerful alternative to existing self-supervised methods in the ECG domain. ECG-JEPA learns semantic representations of ECG data by predicting in the hidden latent space, bypassing the need to reconstruct raw signals, and achieves state-of-the-art performance in various downstream tasks including diagnostic classification, feature extraction, and segmentation.

What carries the argument

ECG-JEPA, a joint-embedding predictive architecture that predicts masked representations in latent space, together with Cross-Pattern Attention (CroPA), a masked attention mechanism designed for the 12-lead structure.

If this is right

  • Representations learned without labels can be directly transferred to diagnostic classification of cardiac conditions.
  • The same pre-trained encoder improves performance on ECG feature extraction and segmentation tasks.
  • Training on the union of open ECG datasets totaling approximately 180,000 samples produces general-purpose 12-lead representations.
  • Cross-Pattern Attention enables the model to exploit inter-lead relationships during masked prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-prediction approach may transfer to other noisy physiological signals such as EEG or EMG.
  • Because raw-signal reconstruction is avoided, the method could lower memory and compute costs during pre-training on large ECG archives.
  • The learned representations might support few-shot adaptation to rare arrhythmia subtypes not seen in the original training union.

Load-bearing premise

Predicting in the latent space rather than reconstructing raw signals addresses the limitations of naive L2 loss and avoids producing unnecessary noise details common in ECG data.

What would settle it

A controlled experiment in which a reconstruction-based self-supervised baseline, trained on the same 180,000-sample union and evaluated on identical downstream splits, matches or exceeds ECG-JEPA accuracy on diagnostic classification and segmentation would falsify the claimed advantage of latent-space prediction.

Figures

Figures reproduced from arXiv: 2410.08559 by Sehun Kim.

Figure 1
Figure 1. Figure 1: 12-lead ECG with baseline wander artifact. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Key ECG Features. Student Teacher Predictor Predictor Predictor Masked patches ECG patches Contextualized representations Masked representations Predictions L1 loss [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ECG-JEPA training overview. For illustration, we use [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-Pattern Attention (CroPA). The patch in the middle attends only to the colored patches. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Squares following the encoder represent the representations of ECG patches. The representations are [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
read the original abstract

Electrocardiogram (ECG) captures the heart's electrical signals, offering valuable information for diagnosing cardiac conditions. However, the scarcity of labeled data makes it challenging to fully leverage supervised learning in the medical domain. Self-supervised learning (SSL) offers a promising solution, enabling models to learn from unlabeled data and uncover meaningful patterns. In this paper, we show that masked modeling in the latent space can be a powerful alternative to existing self-supervised methods in the ECG domain. We introduce ECG-JEPA, an SSL model for 12-lead ECG analysis that learns semantic representations of ECG data by predicting in the hidden latent space, bypassing the need to reconstruct raw signals. This approach offers several advantages in the ECG domain: (1) it avoids producing unnecessary details, such as noise, which is common in ECG; and (2) it addresses the limitations of naive L2 loss between raw signals. Another key contribution is the introduction of Cross-Pattern Attention (CroPA), a specialized masked attention mechanism tailored for 12-lead ECG data. ECG-JEPA is trained on the union of several open ECG datasets, totaling approximately 180,000 samples, and achieves state-of-the-art performance in various downstream tasks including diagnostic classification, feature extraction, and segmentation. Our code is openly available at https://github.com/sehunfromdaegu/ECG_JEPA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces ECG-JEPA, a self-supervised learning model based on the Joint-Embedding Predictive Architecture (JEPA) for 12-lead ECG signals. It performs masked prediction in the latent space rather than raw-signal reconstruction, introduces a Cross-Pattern Attention (CroPA) mechanism for multi-lead data, pretrains on a union of open datasets totaling ~180k samples, and reports state-of-the-art results on downstream tasks including diagnostic classification, feature extraction, and segmentation. The code is released openly.

Significance. If the empirical results hold under standard evaluation protocols, the work provides a useful alternative SSL approach for ECG representation learning that sidesteps issues with raw-signal L2 reconstruction. The open-source code is a clear strength that supports reproducibility and follow-up work in cardiac signal analysis.

minor comments (2)
  1. [Abstract] The abstract asserts SOTA performance without any quantitative metrics or baseline names; adding one or two key numbers (e.g., AUC or Dice improvements) would strengthen the summary.
  2. [Methods] Notation for the CroPA module and the latent-space predictor could be clarified with an explicit equation or diagram reference in the methods section to aid readers unfamiliar with JEPA variants.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report highlights the strengths of ECG-JEPA, including the latent-space prediction approach, CroPA mechanism, pretraining scale, downstream results, and open code release. No major comments are provided in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation is empirical and self-contained

full rationale

The paper introduces ECG-JEPA as an application of JEPA-style latent-space masked prediction to 12-lead ECG, augmented by the new CroPA attention mechanism. Training occurs on ~180k unlabeled samples with standard SSL objectives; performance is measured on held-out downstream tasks (classification, segmentation). No equations or claims reduce a 'prediction' to a fitted input by construction, no self-citation chain bears the central result, and no ansatz is smuggled via prior author work. The approach follows standard self-supervised principles without internal definitional collapse.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of latent-space masked prediction for noisy ECG signals and on the utility of the newly introduced CroPA module. Training uses a union of public datasets whose representativeness is assumed but not independently verified in the abstract.

free parameters (1)
  • Mask ratio and other training hyperparameters
    Standard deep-learning choices whose specific values are not reported in the abstract but affect performance.
axioms (1)
  • domain assumption Predicting representations in latent space is advantageous for ECG because it avoids reconstructing noise and sidesteps limitations of naive L2 loss on raw signals
    Explicitly listed as advantages (1) and (2) in the abstract.
invented entities (1)
  • Cross-Pattern Attention (CroPA) no independent evidence
    purpose: Specialized masked attention mechanism tailored for 12-lead ECG data
    Introduced as a key architectural contribution; no independent evidence of effectiveness outside the paper's claimed results.

pith-pipeline@v0.9.0 · 5768 in / 1185 out tokens · 35270 ms · 2026-05-23T19:14:37.969723+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study

    eess.SP 2026-05 unverdicted novelty 7.0

    Contrastive predictive coding pretraining combined with structured state space models yields the strongest ECG foundation models, with continued gains from scaling data to 11 million samples.

  2. Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons

    cs.LG 2026-05 unverdicted novelty 5.0

    A parameter-efficient plug-in framework adds structurally compatible long-sequence processing and semantically informed temporal modeling to extend pretrained 10-second ECG foundation models to longer variable-length inputs.

  3. ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook

    eess.SP 2026-04 unverdicted novelty 3.0

    ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 3 Pith papers · 3 internal anchors

  1. [1]

    Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network

    Awni Y Hannun, Pranav Rajpurkar, Masoumeh Haghpanahi, Geoffrey H Tison, Codie Bourn, Mintu P Turakhia, and Andrew Y Ng. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature medicine, 25(1):65–69, 2019

  2. [2]

    Automatic diagnosis of the 12-lead ecg using a deep neural network

    Antônio H Ribeiro, Manoel Horta Ribeiro, Gabriela MM Paixão, Derick M Oliveira, Paulo R Gomes, Jéssica A Canazart, Milton PS Ferreira, Carl R Andersson, Peter W Macfarlane, Wagner Meira Jr, et al. Automatic diagnosis of the 12-lead ecg using a deep neural network. Nature communications, 11(1):1760, 2020

  3. [3]

    Artificial intelligence-enhanced electrocardiography in cardiovascular disease management

    Konstantinos C Siontis, Peter A Noseworthy, Zachi I Attia, and Paul A Friedman. Artificial intelligence-enhanced electrocardiography in cardiovascular disease management. Nature Reviews Cardiology, 18(7):465–478, 2021

  4. [4]

    Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. 13

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  6. [6]

    Llama: Open and efficient foundation language models, 2023

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

  7. [7]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020

  8. [8]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  9. [9]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

  10. [10]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems , 35:10078–10093, 2022

  11. [11]

    Revisiting feature prediction for learning visual representations from video, 2024

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024

  12. [12]

    HaoChen, Adrien Gaidon, and Tengyu Ma

    Hong Liu, Jeff Z. HaoChen, Adrien Gaidon, and Tengyu Ma. Self-supervised learning is more robust to dataset imbalance, 2022

  13. [13]

    The only EKG book you’ll ever need

    Malcolm S Thaler. The only EKG book you’ll ever need. Lippincott Williams & Wilkins, 2021

  14. [14]

    Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth

    Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, Avi Schwarzschild, Andrew Gordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pirsiavash, Yann LeCun, and Micah Goldblum. A cookbook of self-supervised learning, 2023. URL h...

  15. [15]

    Extracting and composing robust features with denoising autoencoders

    Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008

  16. [16]

    Learning by reconstruction produces uninformative features for perception,

    Randall Balestriero and Yann LeCun. Learning by reconstruction produces uninformative features for perception,

  17. [17]

    URL https://arxiv.org/abs/2402.11337

  18. [18]

    Bootstrap your own latent-a new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020

  19. [19]

    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

    Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self- supervised learning, 2022. URL https://arxiv.org/abs/2105.04906

  20. [20]

    Exploring simple siamese representation learning, 2020

    Xinlei Chen and Kaiming He. Exploring simple siamese representation learning, 2020. URL https://arxiv. org/abs/2011.10566

  21. [21]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. https:// openreview.net/forum?id=BZ5a1r-kVsf, 2022. Accessed: 2024-06-01

  22. [22]

    Clocs: Contrastive learning of cardiac signals across space, time, and patients

    Dani Kiyasseh, Tingting Zhu, and David A Clifton. Clocs: Contrastive learning of cardiac signals across space, time, and patients. In International Conference on Machine Learning, pages 5606–5615. PMLR, 2021

  23. [23]

    Representation learning with contrastive predictive coding,

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding,

  24. [24]

    URL https://arxiv.org/abs/1807.03748

  25. [25]

    Self-supervised representation learning from 12-lead ecg data

    Temesgen Mehari and Nils Strodthoff. Self-supervised representation learning from 12-lead ecg data. Computers in biology and medicine, 141:105114, 2022

  26. [26]

    Maefe: Masked autoencoders family of electrocardiogram for self-supervised pretraining and transfer learning

    Huaicheng Zhang, Wenhan Liu, Jiguang Shi, Sheng Chang, Hao Wang, Jin He, and Qijun Huang. Maefe: Masked autoencoders family of electrocardiogram for self-supervised pretraining and transfer learning. IEEE Transactions on Instrumentation and Measurement, 72:1–15, 2022. 14

  27. [27]

    Guiding masked representation learning to capture spatio-temporal relationship of electrocardiogram, 2024

    Yeongyeon Na, Minje Park, Yunwon Tae, and Sunghoon Joo. Guiding masked representation learning to capture spatio-temporal relationship of electrocardiogram, 2024

  28. [28]

    A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients

    Jianwei Zheng, Jianming Zhang, Sidy Danioko, Hai Yao, Hangyuan Guo, and Cyril Rakovski. A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. Scientific data, 7(1):48, 2020

  29. [29]

    Optimal multi-stage arrhythmia classification approach

    Jianwei Zheng, Huimin Chu, Daniele Struppa, Jianming Zhang, Sir Magdi Yacoub, Hesham El-Askary, Anthony Chang, Louis Ehwerhemuepha, Islam Abudayyeh, Alexander Barrett, et al. Optimal multi-stage arrhythmia classification approach. Scientific reports, 10(1):2898, 2020

  30. [30]

    Large-scale classification of 12-lead ecg with deep learning

    Yu-Jhen Chen, Chien-Liang Liu, Vincent S Tseng, Yu-Feng Hu, and Shih-Ann Chen. Large-scale classification of 12-lead ecg with deep learning. In 2019 IEEE EMBS international conference on biomedical & health informatics (BHI), pages 1–4. IEEE, 2019

  31. [31]

    Ptb-xl, a large publicly available electrocardiography dataset

    Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Dieter Kreiseler, Fatima I Lunze, Wojciech Samek, and Tobias Schaeffter. Ptb-xl, a large publicly available electrocardiography dataset. Scientific data, 7(1):1–15, 2020

  32. [32]

    An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection

    Feifei Liu, Chengyu Liu, Lina Zhao, Xiangyu Zhang, Xiaoling Wu, Xiaoyan Xu, Yulin Liu, Caiyun Ma, Shoushui Wei, Zhiqiang He, et al. An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. Journal of Medical Imaging and Health Informatics, 8(7):1368–1373, 2018

  33. [33]

    Ecg segmentation by neural networks: Errors and correction

    Iana Sereda, Sergey Alekseev, Aleksandra Koneva, Roman Kataev, and Grigory Osipov. Ecg segmentation by neural networks: Errors and correction. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2019

  34. [34]

    Deep learning for ecg segmentation

    Viktor Moskalenko, Nikolai Zolotykh, and Grigory Osipov. Deep learning for ecg segmentation. In Advances in Neural Computation, Machine Learning, and Cognitive Research III: Selected Papers from the XXI International Conference on Neuroinformatics, October 7-11, 2019, Dolgoprudny, Moscow Region, Russia, pages 246–254. Springer, 2020

  35. [35]

    Post-processing refined ecg delineation based on 1d-unet

    Zhenqin Chen, Mengying Wang, Meiyu Zhang, Wei Huang, Hanjie Gu, and Jinshan Xu. Post-processing refined ecg delineation based on 1d-unet. Biomedical Signal Processing and Control, 79:104106, 2023

  36. [36]

    Deep learning based ecg segmentation for delineation of diverse arrhythmias

    Chankyu Joung, Mijin Kim, Taejin Paik, Seong-Ho Kong, Seung-Young Oh, Won Kyeong Jeon, Jae-hu Jeon, Joong-Sik Hong, Wan-Joong Kim, Woong Kook, et al. Deep learning based ecg segmentation for delineation of diverse arrhythmias. PloS one, 19(6):e0303178, 2024

  37. [37]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2018. URL https://arxiv.org/abs/1706.02677

  38. [38]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 15