pith. sign in

arxiv: 2607.01145 · v1 · pith:P24ULFJ7new · submitted 2026-07-01 · 💻 cs.LG · eess.SP

A Lightweight Self-Supervised Learning Framework for Multivariate Time Series using Hierarchical-JEPA on ECG Data

Pith reviewed 2026-07-02 15:24 UTC · model grok-4.3

classification 💻 cs.LG eess.SP
keywords self-supervised learningECG analysismultivariate time serieshierarchical JEPAER-JEPAST-MEM benchmarkVision Transformerlightweight framework
0
0 comments X

The pith

ER-JEPA uses a two-stage hierarchical structure of concatenated JEPAs to reach state-of-the-art ECG performance after pretraining on 180000 recordings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces ER-JEPA, a self-supervised framework for multivariate time series such as 12-lead ECG recordings when labeled target data is scarce. It constructs representations in two stages by first handling time intervals and then treating those outputs as a univariate series, using the structural concatenation of two Joint-Embedding Predictive Architectures. A Vision Transformer backbone supports pretraining on roughly 180000 unlabeled 10-second recordings. The resulting model delivers state-of-the-art results on the ST-MEM benchmark while requiring only rapid computation and minimal resources.

Core claim

The paper claims that the structural concatenation of two JEPAs into a Hierarchical JEPA encodes multiple levels of abstract representations for enhanced prediction performance on complex ECG tasks, as demonstrated by state-of-the-art downstream performance on the ST-MEM benchmark after pretraining on approximately 180000 10-second recordings.

What carries the argument

The Hierarchical JEPA (H-JEPA) formed by concatenating two Joint-Embedding Predictive Architectures in a two-stage structure that first builds interval representations and then processes them as a univariate time series.

If this is right

  • The model achieves state-of-the-art downstream performance on the ST-MEM benchmark.
  • Pretraining on large unlabeled ECG datasets enables strong results despite limited labeled target data.
  • The approach requires only rapid computation and minimal resource usage.
  • Sensitivity analysis of hierarchical representations during pretraining reveals design choices for multi-level encoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-stage interval-to-series design could transfer to other multivariate signals such as EEG or industrial sensor streams.
  • Low resource demands suggest feasibility for continuous monitoring on portable medical devices.
  • The observed sensitivity of hierarchical levels may inform scaling decisions in related self-supervised time-series models.

Load-bearing premise

The two-fold hierarchical structure that concatenates two JEPAs will encode multiple levels of abstract representations and thereby produce enhanced prediction performance on complex ECG tasks.

What would settle it

A controlled experiment in which a single non-hierarchical JEPA matches or exceeds the ST-MEM benchmark scores after identical pretraining on the same 180000 recordings would falsify the necessity of the two-fold structure.

Figures

Figures reproduced from arXiv: 2607.01145 by Siwon Kim.

Figure 1
Figure 1. Figure 1: Overview of Joint-Embedding Predictive Architecture. (a) The objective of the learning is the prediction of an embedding from a compatible signal with a predictor network, guided by a (possibly latent) variable. (b) Basic example of a two-level Hierarchical JEPA. The latent-space learning process of JEPA is intrinsically suited for hierarchical composition. of each input by predicting a target embedding fr… view at source ↗
Figure 2
Figure 2. Figure 2: Schematic of ER-JEPA. For a multivariate time series with C channels, the pre-embedding layer first tokenizes the recording into C × N patches indexed by (channel, interval). Next, the event reconstruction module, which consists of a channel encoder and an aggregation layer, produces a sequence of tokens representing each time interval. The event analysis module then processes the resulting sequence to cap… view at source ↗
Figure 3
Figure 3. Figure 3: Forward Pass of Unit Channel Sequence. Given tokens of a multivariate times series indexed by (channel, time interval), channel attention layers process the input by time interval, taking concurrent tokens of every channel as a unit sequence. Next, an aggregation layer summarizes the Nch tokens of each time interval into a single token. These summarized tokens then form the input sequence for the temporal … view at source ↗
Figure 4
Figure 4. Figure 4: Example of Temporal Masks. Context and target masks for each sample are randomly generated based on a predetermined configuration. For each sample within a batch, target masks are sampled as contiguous blocks, with the total target size fixed across the batch. Next, the context mask is sampled from the remaining pool after excluding the target selection; the volume of context selection remains constant acr… view at source ↗
Figure 5
Figure 5. Figure 5: Pretraining Loss of the Channel and Temporal JEPA. (a) Evolution of channel and temporal loss during pretraining. Both JEPAs exhibit an early-stage drop in loss, with the temporal loss reaching a lower minimum loss of 7 × 10−4 . (b) Impact of the channel JEPA on temporal loss. The plot compares the temporal loss of the complete model against baseline architectures without a concatenated JEPA structure, inc… view at source ↗
Figure 6
Figure 6. Figure 6: Histograms of Loss Value at Epoch 2. Under fixed pretraining configurations, 500 repeated trials yielded approximately 1 % of anomalous cases, characterized by relatively high loss values at Epoch 2. This phenomenon was also observed in the rare trials that produced inferior downstream performance. The five trials with the highest total loss corresponded to cases of both channel and temporal loss. tion of … view at source ↗
Figure 7
Figure 7. Figure 7: Computational Efficiency Comparison. The ViT-based encoders are evaluated based on batch latency and peak GPU memory usage. (a) Performance using the native settings from the downstream classification benchmark. (b) A controlled comparison where all encoders are standardized to a unified embedding dimension of 768. highly effective configuration, the encoder of ER-JEPA with these hyperparameters gains a si… view at source ↗
read the original abstract

Data analysis in the medical domain often encounters scenarios involving a limited target dataset and a large, unannotated dataset with a general distribution. Under such circumstances, self-supervised learning (SSL) methods are highly effective for utilizing large datasets, making them a popular choice for electrocardiogram (ECG) analysis. This work presents the Event Reconstruction Joint-Embedding Predictive Architecture (ER-JEPA), a lightweight SSL framework for multivariate time series, whose name and two-fold hierarchical structure are inspired by the diagnostic approach of cardiologists. At its core, ER-JEPA features: (1) a two-stage structure that constructs representations for each time interval and subsequently processes these representations as a univariate time series, (2) the hierarchical integration of two Joint-Embedding Predictive Architectures (JEPAs), and (3) a Vision Transformer (ViT) backbone. The structural concatenation of two JEPAs categorizes the model as a Hierarchical JEPA (H-JEPA), designed to encode multiple levels of abstract representations for enhanced prediction on complex tasks. This study reports a successful application of H-JEPA to 12-lead ECG data as a multivariate time series alongside an analysis of the sensitivity of hierarchical representation during the pretraining stage. Pretrained on approximately 180,000 10-second recordings, the model achieves state-of-the-art downstream performance on the ST-MEM benchmark, with rapid computation and minimal resource usage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces ER-JEPA (Event Reconstruction Joint-Embedding Predictive Architecture), a lightweight self-supervised learning framework for multivariate time series on 12-lead ECG data. It employs a two-stage hierarchical structure that first builds representations for time-interval patches and then processes them as a univariate series via concatenated JEPAs with a ViT backbone. Pretrained on ~180k 10-second recordings, the model reports state-of-the-art results on the ST-MEM benchmark together with a sensitivity analysis of the hierarchy and claims of low computational cost.

Significance. If the empirical results hold, the work supplies a coherent, resource-efficient SSL method for ECG analysis that exploits large unlabeled corpora. The explicit sensitivity analysis of the hierarchical component and the reported resource metrics constitute concrete strengths for a methods contribution in this domain.

minor comments (2)
  1. [Abstract] Abstract: the relationship between the proper name ER-JEPA and the category label H-JEPA is introduced but not fully disambiguated; a single sentence clarifying that ER-JEPA is an instance of H-JEPA would remove ambiguity.
  2. [Abstract] The abstract states that the hierarchy encodes 'multiple levels of abstract representations'; the sensitivity analysis mentioned in the abstract should be cross-referenced to the specific figure or table that quantifies this contribution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their thoughtful review and positive recommendation for minor revision. We are pleased that the significance of the empirical results, the sensitivity analysis of the hierarchical component, and the reported resource efficiency were recognized as strengths of the work.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces ER-JEPA/H-JEPA as a two-stage hierarchical concatenation of JEPAs on a ViT backbone, motivated by cardiologist diagnostic analogy and applied to 12-lead ECG as multivariate time series. Pretraining occurs on ~180k recordings, with downstream SOTA reported on ST-MEM plus sensitivity analysis of the hierarchy. No equations or claims reduce the benchmark performance to fitted parameters by construction; the hierarchy is presented as an architectural extension whose contribution is tested experimentally rather than assumed tautologically. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear as load-bearing steps. The reported results therefore stand as independent empirical outcomes of the described pretraining procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review performed on abstract only; therefore the ledger records only explicitly named components and standard assumptions visible in the text. No numerical free parameters are stated. The new model components are treated as invented entities without external validation.

axioms (1)
  • domain assumption Vision Transformer backbone is appropriate for ECG time-series patches
    Used without further justification in the abstract.
invented entities (2)
  • ER-JEPA no independent evidence
    purpose: Lightweight SSL framework for multivariate ECG time series
    Newly named two-stage hierarchical model introduced in the paper.
  • H-JEPA no independent evidence
    purpose: Hierarchical concatenation of two JEPAs for multi-level representations
    Structural element defined by the paper.

pith-pipeline@v0.9.1-grok · 5778 in / 1340 out tokens · 31799 ms · 2026-07-02T15:24:36.655171+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    A simple frame- work for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple frame- work for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  2. [2]

    Bootstrap your own latent: A new approach to self-supervised learning,

    Jean-Bastien Grill, Florian Strub, Florent Altch´ e, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning.arXiv preprint arXiv:2006.07733, 2020

  3. [3]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  4. [4]

    Masked Autoencoders Are Scalable Vision Learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ ar, and Ross Girshick. Masked autoencoders are scalable vision learners.arXiv:2111.06377, 2021

  5. [5]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021

  6. [6]

    Simmim: A simple framework for masked image modeling

    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022. 21

  7. [7]

    Lippincott Williams & Wilkins, 2021

    Malcolm S Thaler.The only EKG book you’ll ever need. Lippincott Williams & Wilkins, 2021

  8. [8]

    Guiding masked representa- tion learning to capture spatio-temporal relationship of electrocardiogram.arXiv preprint arXiv:2402.09450, 2024

    Yeongyeon Na, Minje Park, Yunwon Tae, and Sunghoon Joo. Guiding masked representa- tion learning to capture spatio-temporal relationship of electrocardiogram.arXiv preprint arXiv:2402.09450, 2024

  9. [9]

    Foundation model of ecg diagnosis: Diagnostics and explanations of any form and rhythm on ecg.Cell Reports Medicine, 5(12), 2024

    Yuanyuan Tian, Zhiyuan Li, Yanrui Jin, Mengxiao Wang, Xiaoyang Wei, Liqun Zhao, Yun- qing Liu, Jinlei Liu, and Chengliang Liu. Foundation model of ecg diagnosis: Diagnostics and explanations of any form and rhythm on ecg.Cell Reports Medicine, 5(12), 2024

  10. [10]

    Ecg-fm: An open electrocardiogram foundation model.Jamia Open, 8(5):ooaf122, 2025

    Kaden McKeen, Sameer Masood, Augustin Toma, Barry Rubin, and Bo Wang. Ecg-fm: An open electrocardiogram foundation model.Jamia Open, 8(5):ooaf122, 2025

  11. [11]

    Self-supervised pre-training with joint-embedding predictive architecture boosts ecg classification performance.Computers in Biology and Medicine, 196:110809, 2025

    Kuba Weimann and Tim OF Conrad. Self-supervised pre-training with joint-embedding predictive architecture boosts ecg classification performance.Computers in Biology and Medicine, 196:110809, 2025

  12. [12]

    Learning General Representation of 12-Lead Electrocardiogram with a Joint-Embedding Predictive Architecture

    Sehun Kim. Learning general representation of 12-lead electrocardiogram with a joint- embedding predictive architecture.arXiv preprint arXiv:2410.08559, 2024

  13. [13]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  14. [14]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  15. [15]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022- 06-27.Open Review, 62(1):1–62, 2022

  16. [16]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

  17. [17]

    World Models

    David Ha and J¨ urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

  18. [18]

    Ptb-xl, a large publicly available electrocardiog- raphy dataset.Scientific data, 7(1):154, 2020

    Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Dieter Kreiseler, Fatima I Lunze, Wojciech Samek, and Tobias Schaeffter. Ptb-xl, a large publicly available electrocardiog- raphy dataset.Scientific data, 7(1):154, 2020

  19. [19]

    Feifei Liu, Chengyu Liu, Lina Zhao, Xiangyu Zhang, Xiaoling Wu, Xiaoyan Xu, Yulin Liu, Caiyun Ma, Shoushui Wei, Zhiqiang He, et al. An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection.Journal of Medical Imaging and Health Informatics, 8(7):1368–1373, 2018

  20. [20]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  21. [21]

    A tutorial on energy-based learning.Predicting structured data, 1(0), 2006

    Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, Fujie Huang, et al. A tutorial on energy-based learning.Predicting structured data, 1(0), 2006. 22

  22. [22]

    A large scale 12-lead electrocardiogram database for arrhythmia study (version 1.0

    Jianwei Zheng, Hangyuan Guo, and Huimin Chu. A large scale 12-lead electrocardiogram database for arrhythmia study (version 1.0. 0).PhysioNet 2022Available online httpphys- ionet orgcontentecg arrhythmia10 0accessed on, 23:7, 2022

  23. [23]

    Code-15%: A large scale annotated dataset of 12-lead ecgs.Zenodo, Jun, 9:10–5281, 2021

    Antˆ onio H Ribeiro, GM Paixao, Emilly M Lima, M Horta Ribeiro, Marcelo M Pinto Filho, Paulo R Gomes, Derick M Oliveira, Wagner Meira Jr, Th¨ omas B Schon, and Antonio Luiz P Ribeiro. Code-15%: A large scale annotated dataset of 12-lead ecgs.Zenodo, Jun, 9:10–5281, 2021

  24. [24]

    Tele-electrocardiography and bigdata: the code (clinical outcomes in digital electrocardiography) study.Journal of electrocardiology, 57:S75–S78, 2019

    Antonio Luiz P Ribeiro, Gabriela MM Paixao, Paulo R Gomes, Manoel Horta Ribeiro, An- tonio H Ribeiro, Jessica A Canazart, Derick M Oliveira, Milton P Ferreira, Emilly M Lima, Jermana Lopes de Moraes, et al. Tele-electrocardiography and bigdata: the code (clinical outcomes in digital electrocardiography) study.Journal of electrocardiology, 57:S75–S78, 2019

  25. [25]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  26. [26]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016

  27. [27]

    arXiv preprint arXiv:2202.03555 (2022)

    Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555, 2022

  28. [28]

    Exploring simple siamese representation learning.arXiv preprint arXiv:2011.10566, 2020

    Xinlei Chen and Kaiming He. Exploring simple siamese representation learning.arXiv preprint arXiv:2011.10566, 2020

  29. [29]

    Improving neural networks by preventing co-adaptation of feature detectors

    Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detec- tors.arXiv preprint arXiv:1207.0580, 2012. 23 A Hyperparameters In this appendix, we provide the detailed hyperparameters used for the pretraining, linear evaluation, and fine-tuning pha...