Learning General Representation of 12-Lead Electrocardiogram with a Joint-Embedding Predictive Architecture
Pith reviewed 2026-05-23 19:14 UTC · model grok-4.3
The pith
ECG-JEPA learns semantic 12-lead ECG representations by predicting masked tokens in latent space rather than reconstructing raw signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Masked modeling in the latent space can be a powerful alternative to existing self-supervised methods in the ECG domain. ECG-JEPA learns semantic representations of ECG data by predicting in the hidden latent space, bypassing the need to reconstruct raw signals, and achieves state-of-the-art performance in various downstream tasks including diagnostic classification, feature extraction, and segmentation.
What carries the argument
ECG-JEPA, a joint-embedding predictive architecture that predicts masked representations in latent space, together with Cross-Pattern Attention (CroPA), a masked attention mechanism designed for the 12-lead structure.
If this is right
- Representations learned without labels can be directly transferred to diagnostic classification of cardiac conditions.
- The same pre-trained encoder improves performance on ECG feature extraction and segmentation tasks.
- Training on the union of open ECG datasets totaling approximately 180,000 samples produces general-purpose 12-lead representations.
- Cross-Pattern Attention enables the model to exploit inter-lead relationships during masked prediction.
Where Pith is reading between the lines
- The same latent-prediction approach may transfer to other noisy physiological signals such as EEG or EMG.
- Because raw-signal reconstruction is avoided, the method could lower memory and compute costs during pre-training on large ECG archives.
- The learned representations might support few-shot adaptation to rare arrhythmia subtypes not seen in the original training union.
Load-bearing premise
Predicting in the latent space rather than reconstructing raw signals addresses the limitations of naive L2 loss and avoids producing unnecessary noise details common in ECG data.
What would settle it
A controlled experiment in which a reconstruction-based self-supervised baseline, trained on the same 180,000-sample union and evaluated on identical downstream splits, matches or exceeds ECG-JEPA accuracy on diagnostic classification and segmentation would falsify the claimed advantage of latent-space prediction.
Figures
read the original abstract
Electrocardiogram (ECG) captures the heart's electrical signals, offering valuable information for diagnosing cardiac conditions. However, the scarcity of labeled data makes it challenging to fully leverage supervised learning in the medical domain. Self-supervised learning (SSL) offers a promising solution, enabling models to learn from unlabeled data and uncover meaningful patterns. In this paper, we show that masked modeling in the latent space can be a powerful alternative to existing self-supervised methods in the ECG domain. We introduce ECG-JEPA, an SSL model for 12-lead ECG analysis that learns semantic representations of ECG data by predicting in the hidden latent space, bypassing the need to reconstruct raw signals. This approach offers several advantages in the ECG domain: (1) it avoids producing unnecessary details, such as noise, which is common in ECG; and (2) it addresses the limitations of naive L2 loss between raw signals. Another key contribution is the introduction of Cross-Pattern Attention (CroPA), a specialized masked attention mechanism tailored for 12-lead ECG data. ECG-JEPA is trained on the union of several open ECG datasets, totaling approximately 180,000 samples, and achieves state-of-the-art performance in various downstream tasks including diagnostic classification, feature extraction, and segmentation. Our code is openly available at https://github.com/sehunfromdaegu/ECG_JEPA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ECG-JEPA, a self-supervised learning model based on the Joint-Embedding Predictive Architecture (JEPA) for 12-lead ECG signals. It performs masked prediction in the latent space rather than raw-signal reconstruction, introduces a Cross-Pattern Attention (CroPA) mechanism for multi-lead data, pretrains on a union of open datasets totaling ~180k samples, and reports state-of-the-art results on downstream tasks including diagnostic classification, feature extraction, and segmentation. The code is released openly.
Significance. If the empirical results hold under standard evaluation protocols, the work provides a useful alternative SSL approach for ECG representation learning that sidesteps issues with raw-signal L2 reconstruction. The open-source code is a clear strength that supports reproducibility and follow-up work in cardiac signal analysis.
minor comments (2)
- [Abstract] The abstract asserts SOTA performance without any quantitative metrics or baseline names; adding one or two key numbers (e.g., AUC or Dice improvements) would strengthen the summary.
- [Methods] Notation for the CroPA module and the latent-space predictor could be clarified with an explicit equation or diagram reference in the methods section to aid readers unfamiliar with JEPA variants.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report highlights the strengths of ECG-JEPA, including the latent-space prediction approach, CroPA mechanism, pretraining scale, downstream results, and open code release. No major comments are provided in the report.
Circularity Check
No significant circularity; derivation is empirical and self-contained
full rationale
The paper introduces ECG-JEPA as an application of JEPA-style latent-space masked prediction to 12-lead ECG, augmented by the new CroPA attention mechanism. Training occurs on ~180k unlabeled samples with standard SSL objectives; performance is measured on held-out downstream tasks (classification, segmentation). No equations or claims reduce a 'prediction' to a fitted input by construction, no self-citation chain bears the central result, and no ansatz is smuggled via prior author work. The approach follows standard self-supervised principles without internal definitional collapse.
Axiom & Free-Parameter Ledger
free parameters (1)
- Mask ratio and other training hyperparameters
axioms (1)
- domain assumption Predicting representations in latent space is advantageous for ECG because it avoids reconstructing noise and sidesteps limitations of naive L2 loss on raw signals
invented entities (1)
-
Cross-Pattern Attention (CroPA)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
masked modeling in the latent space... avoids producing unnecessary details, such as noise... addresses the limitations of naïve L2 loss between raw signals
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ECG-JEPA... transformer... CroPA... student-teacher... smooth L1 loss
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study
Contrastive predictive coding pretraining combined with structured state space models yields the strongest ECG foundation models, with continued gains from scaling data to 11 million samples.
-
Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons
A parameter-efficient plug-in framework adds structurally compatible long-sequence processing and semantically informed temporal modeling to extend pretrained 10-second ECG foundation models to longer variable-length inputs.
-
ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook
ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.
Reference graph
Works this paper leans on
-
[1]
Awni Y Hannun, Pranav Rajpurkar, Masoumeh Haghpanahi, Geoffrey H Tison, Codie Bourn, Mintu P Turakhia, and Andrew Y Ng. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature medicine, 25(1):65–69, 2019
work page 2019
-
[2]
Automatic diagnosis of the 12-lead ecg using a deep neural network
Antônio H Ribeiro, Manoel Horta Ribeiro, Gabriela MM Paixão, Derick M Oliveira, Paulo R Gomes, Jéssica A Canazart, Milton PS Ferreira, Carl R Andersson, Peter W Macfarlane, Wagner Meira Jr, et al. Automatic diagnosis of the 12-lead ecg using a deep neural network. Nature communications, 11(1):1760, 2020
work page 2020
-
[3]
Artificial intelligence-enhanced electrocardiography in cardiovascular disease management
Konstantinos C Siontis, Peter A Noseworthy, Zachi I Attia, and Paul A Friedman. Artificial intelligence-enhanced electrocardiography in cardiovascular disease management. Nature Reviews Cardiology, 18(7):465–478, 2021
work page 2021
-
[4]
Bert: Pre-training of deep bidirectional transformers for language understanding, 2019
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. 13
work page 2019
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[6]
Llama: Open and efficient foundation language models, 2023
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023
work page 2023
-
[7]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020
work page 2020
-
[8]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
work page 2022
-
[9]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023
work page 2023
-
[10]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems , 35:10078–10093, 2022
work page 2022
-
[11]
Revisiting feature prediction for learning visual representations from video, 2024
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024
work page 2024
-
[12]
HaoChen, Adrien Gaidon, and Tengyu Ma
Hong Liu, Jeff Z. HaoChen, Adrien Gaidon, and Tengyu Ma. Self-supervised learning is more robust to dataset imbalance, 2022
work page 2022
-
[13]
The only EKG book you’ll ever need
Malcolm S Thaler. The only EKG book you’ll ever need. Lippincott Williams & Wilkins, 2021
work page 2021
-
[14]
Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth
Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, Avi Schwarzschild, Andrew Gordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pirsiavash, Yann LeCun, and Micah Goldblum. A cookbook of self-supervised learning, 2023. URL h...
-
[15]
Extracting and composing robust features with denoising autoencoders
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008
work page 2008
-
[16]
Learning by reconstruction produces uninformative features for perception,
Randall Balestriero and Yann LeCun. Learning by reconstruction produces uninformative features for perception,
- [17]
-
[18]
Bootstrap your own latent-a new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020
work page 2020
-
[19]
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self- supervised learning, 2022. URL https://arxiv.org/abs/2105.04906
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Exploring simple siamese representation learning, 2020
Xinlei Chen and Kaiming He. Exploring simple siamese representation learning, 2020. URL https://arxiv. org/abs/2011.10566
-
[21]
A path towards autonomous machine intelligence version 0.9
Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. https:// openreview.net/forum?id=BZ5a1r-kVsf, 2022. Accessed: 2024-06-01
work page 2022
-
[22]
Clocs: Contrastive learning of cardiac signals across space, time, and patients
Dani Kiyasseh, Tingting Zhu, and David A Clifton. Clocs: Contrastive learning of cardiac signals across space, time, and patients. In International Conference on Machine Learning, pages 5606–5615. PMLR, 2021
work page 2021
-
[23]
Representation learning with contrastive predictive coding,
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding,
-
[24]
URL https://arxiv.org/abs/1807.03748
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Self-supervised representation learning from 12-lead ecg data
Temesgen Mehari and Nils Strodthoff. Self-supervised representation learning from 12-lead ecg data. Computers in biology and medicine, 141:105114, 2022
work page 2022
-
[26]
Huaicheng Zhang, Wenhan Liu, Jiguang Shi, Sheng Chang, Hao Wang, Jin He, and Qijun Huang. Maefe: Masked autoencoders family of electrocardiogram for self-supervised pretraining and transfer learning. IEEE Transactions on Instrumentation and Measurement, 72:1–15, 2022. 14
work page 2022
-
[27]
Yeongyeon Na, Minje Park, Yunwon Tae, and Sunghoon Joo. Guiding masked representation learning to capture spatio-temporal relationship of electrocardiogram, 2024
work page 2024
-
[28]
A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients
Jianwei Zheng, Jianming Zhang, Sidy Danioko, Hai Yao, Hangyuan Guo, and Cyril Rakovski. A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. Scientific data, 7(1):48, 2020
work page 2020
-
[29]
Optimal multi-stage arrhythmia classification approach
Jianwei Zheng, Huimin Chu, Daniele Struppa, Jianming Zhang, Sir Magdi Yacoub, Hesham El-Askary, Anthony Chang, Louis Ehwerhemuepha, Islam Abudayyeh, Alexander Barrett, et al. Optimal multi-stage arrhythmia classification approach. Scientific reports, 10(1):2898, 2020
work page 2020
-
[30]
Large-scale classification of 12-lead ecg with deep learning
Yu-Jhen Chen, Chien-Liang Liu, Vincent S Tseng, Yu-Feng Hu, and Shih-Ann Chen. Large-scale classification of 12-lead ecg with deep learning. In 2019 IEEE EMBS international conference on biomedical & health informatics (BHI), pages 1–4. IEEE, 2019
work page 2019
-
[31]
Ptb-xl, a large publicly available electrocardiography dataset
Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Dieter Kreiseler, Fatima I Lunze, Wojciech Samek, and Tobias Schaeffter. Ptb-xl, a large publicly available electrocardiography dataset. Scientific data, 7(1):1–15, 2020
work page 2020
-
[32]
Feifei Liu, Chengyu Liu, Lina Zhao, Xiangyu Zhang, Xiaoling Wu, Xiaoyan Xu, Yulin Liu, Caiyun Ma, Shoushui Wei, Zhiqiang He, et al. An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. Journal of Medical Imaging and Health Informatics, 8(7):1368–1373, 2018
work page 2018
-
[33]
Ecg segmentation by neural networks: Errors and correction
Iana Sereda, Sergey Alekseev, Aleksandra Koneva, Roman Kataev, and Grigory Osipov. Ecg segmentation by neural networks: Errors and correction. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2019
work page 2019
-
[34]
Deep learning for ecg segmentation
Viktor Moskalenko, Nikolai Zolotykh, and Grigory Osipov. Deep learning for ecg segmentation. In Advances in Neural Computation, Machine Learning, and Cognitive Research III: Selected Papers from the XXI International Conference on Neuroinformatics, October 7-11, 2019, Dolgoprudny, Moscow Region, Russia, pages 246–254. Springer, 2020
work page 2019
-
[35]
Post-processing refined ecg delineation based on 1d-unet
Zhenqin Chen, Mengying Wang, Meiyu Zhang, Wei Huang, Hanjie Gu, and Jinshan Xu. Post-processing refined ecg delineation based on 1d-unet. Biomedical Signal Processing and Control, 79:104106, 2023
work page 2023
-
[36]
Deep learning based ecg segmentation for delineation of diverse arrhythmias
Chankyu Joung, Mijin Kim, Taejin Paik, Seong-Ho Kong, Seung-Young Oh, Won Kyeong Jeon, Jae-hu Jeon, Joong-Sik Hong, Wan-Joong Kim, Woong Kook, et al. Deep learning based ecg segmentation for delineation of diverse arrhythmias. PloS one, 19(6):e0303178, 2024
work page 2024
-
[37]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2018. URL https://arxiv.org/abs/1706.02677
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[38]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 15
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.