How Do Electrocardiogram Models Scale?
Pith reviewed 2026-05-20 13:42 UTC · model grok-4.3
The pith
Self-supervised learning enables robust scaling of ECG models with both model and data size, while ResNets are more parameter-efficient than Transformers for out-of-distribution tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Pre-training 120 models from 20K to 200M parameters on the CODE dataset of 2.3M records reveals that SL models are data-bottlenecked in-distribution, whereas SSL models scale robustly across both model and data sizes. For OOD generalization, ResNets are 1.3 to 2.5 times more parameter-efficient than Transformers, while SSL is up to 16 times more data-efficient and achieves up to 7.6 times higher transfer efficiency than SL on unseen clinical tasks. ResNet-based models generally achieve the lowest OOD loss, with SSL dominating on unseen clinical tasks and self-supervised Transformers overtaking at very large model sizes.
What carries the argument
The decoupling of architecture choice (ResNet vs Transformer) and pre-training paradigm (SL vs SSL) through systematic scaling experiments on the CODE dataset.
If this is right
- SSL models will continue to benefit from increases in both model size and pre-training data size for in-distribution performance.
- ResNet architectures will require 1.3-2.5 times fewer parameters than Transformers to achieve equivalent OOD generalization in ECG tasks.
- SSL pre-training will provide up to 16 times better data efficiency and 7.6 times higher transfer efficiency to new clinical tasks compared to SL.
- Self-supervised Transformers may outperform ResNets when model sizes exceed the tested range.
- The most effective ECG foundation models will result from aligning architecture and pre-training paradigm rather than relying on larger scales alone.
Where Pith is reading between the lines
- Clinicians might prefer smaller ResNet-based SSL models for practical deployment because of their efficiency advantages.
- These efficiency patterns could guide scaling strategies for other biomedical time-series signals beyond ECG.
- Testing the same architectures on additional hospital datasets would check whether the reported data and transfer efficiencies generalize.
Load-bearing premise
The scaling trends and efficiency advantages observed for SSL and ResNets on the CODE dataset will persist when applied to larger models, different data sources, or additional clinical tasks.
What would settle it
Training models larger than 200M parameters or evaluating on ECG records from an unseen hospital system and finding that SSL no longer scales or that Transformers match or exceed ResNet efficiency would falsify the central claims.
Figures
read the original abstract
While scaling laws have established a fundamental framework for foundation models in natural language processing, their applicability to electrocardiogram (ECG) models remains poorly characterized. Indeed, recent studies do not always yield consistent downstream gains as one increases the model size or pre-training dataset size of ECG models, leaving the exact roles of architectural inductive biases, pre-training paradigms, and expected improvements with size largely unanswered. In this work, we systematically investigate neural and loss-to-loss scaling laws within the ECG domain. By pre-training over $120$ models (ranging from $20$K to $200$M parameters) on the large-scale CODE dataset ($2.3$M records), we decouple the effects of model architecture (ResNet vs. Transformer) and pre-training paradigm, namely supervised learning (SL) versus self-supervised learning (SSL). We found that (i) SL models are data-bottlenecked in-distribution, whereas SSL models scale robustly across both model and data sizes; (ii) for out-of-distribution (OOD) generalization, ResNets are $1.3$ to $2.5$ times more parameter-efficient than Transformers, while SSL is up to $16$ times more data-efficient and achieves up to $7.6$ times higher transfer efficiency than SL on unseen clinical tasks; (iii) across the observed scales, ResNet-based models generally achieve the lowest OOD loss, with SSL dominating on unseen clinical tasks and self-supervised Transformers overtaking at very large model sizes. Our results suggest that the path to effective ECG foundation models lies in the strategic alignment of architecture and paradigm rather than brute-force scaling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper systematically studies scaling laws for ECG models by pre-training 120 models (20K–200M parameters) on the CODE dataset (2.3M records). It decouples architecture (ResNet vs. Transformer) from pre-training paradigm (SL vs. SSL) and reports that SSL models scale robustly across model and data sizes; for OOD generalization ResNets are 1.3–2.5× more parameter-efficient than Transformers, SSL is up to 16× more data-efficient, and SSL achieves up to 7.6× higher transfer efficiency than SL on unseen clinical tasks. The authors conclude that strategic alignment of architecture and paradigm, rather than brute-force scaling, is the path to effective ECG foundation models.
Significance. If the reported scaling behaviors and efficiency ratios hold under matched optimization, the work supplies the first large-scale empirical map of neural and loss-to-loss scaling in the ECG domain. The scale of the experiment (120 models) and the explicit decoupling of architecture from paradigm constitute a clear advance over prior inconsistent scaling observations in ECG literature.
major comments (3)
- [Abstract / Results] Abstract and Results: the reported efficiency multipliers (ResNets 1.3–2.5× more parameter-efficient; SSL up to 16× more data-efficient and 7.6× higher transfer efficiency) are presented without error bars, confidence intervals, or statistical significance tests. This omission makes it impossible to judge whether the gaps exceed run-to-run variability.
- [Results] Results (OOD generalization claims): the central comparative result that ResNets are more parameter-efficient than Transformers for OOD tasks assumes comparable hyperparameter optimization across architectures. The manuscript does not state whether learning-rate schedules, warmup, or regularization search budgets were matched for Transformers, which typically require distinct tuning on time-series data; unequal tuning could produce the observed 1.3–2.5× gap without reflecting intrinsic architectural differences.
- [Methods] Methods / Experimental setup: the exact functional form fitted to obtain the scaling laws and the criteria used to designate tasks as OOD are not specified. Post-hoc selection of which downstream tasks count as OOD could inflate the reported transfer-efficiency multipliers.
minor comments (1)
- [Abstract] Abstract: the terms 'neural and loss-to-loss scaling laws' are used without a brief definition or reference, which may hinder readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. The comments highlight important aspects of statistical rigor, experimental fairness, and methodological transparency that will strengthen the manuscript. We address each major comment below and indicate the revisions we will implement.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results: the reported efficiency multipliers (ResNets 1.3–2.5× more parameter-efficient; SSL up to 16× more data-efficient and 7.6× higher transfer efficiency) are presented without error bars, confidence intervals, or statistical significance tests. This omission makes it impossible to judge whether the gaps exceed run-to-run variability.
Authors: We agree that reporting variability is essential for interpreting the efficiency multipliers. In the revised manuscript we will add error bars derived from multiple independent training runs (different random seeds) or bootstrap resampling for the key ratios. We will also include statistical significance tests (e.g., paired t-tests across matched runs) to confirm that the reported gaps exceed run-to-run variability. These additions will be placed in both the abstract summary and the main Results section. revision: yes
-
Referee: [Results] Results (OOD generalization claims): the central comparative result that ResNets are more parameter-efficient than Transformers for OOD tasks assumes comparable hyperparameter optimization across architectures. The manuscript does not state whether learning-rate schedules, warmup, or regularization search budgets were matched for Transformers, which typically require distinct tuning on time-series data; unequal tuning could produce the observed 1.3–2.5× gap without reflecting intrinsic architectural differences.
Authors: We acknowledge the importance of documenting hyperparameter fairness. Our protocol performed architecture-specific grid searches over learning rate, warmup steps, batch size, and regularization strength, with separate budgets allocated to ResNets and Transformers to accommodate time-series characteristics. In the revision we will expand the Methods section with a table listing the search ranges, number of trials per architecture, and final selected hyperparameters. This documentation will clarify that tuning effort was matched to the extent permitted by compute limits. We remain open to additional targeted experiments if the referee recommends specific configurations. revision: partial
-
Referee: [Methods] Methods / Experimental setup: the exact functional form fitted to obtain the scaling laws and the criteria used to designate tasks as OOD are not specified. Post-hoc selection of which downstream tasks count as OOD could inflate the reported transfer-efficiency multipliers.
Authors: We thank the referee for noting these omissions. Scaling laws were fitted with the power-law form L(x) = a·x^(-b) + c using nonlinear least-squares optimization; we will state this functional form explicitly in the revised Methods. OOD tasks were defined as clinical prediction problems absent from pre-training with measurable distribution shifts in patient population or acquisition conditions. To address post-hoc selection concerns we will add a sensitivity analysis showing transfer-efficiency multipliers under alternative OOD groupings. Both the fitting procedure and OOD criteria will be moved into the main text with supporting details in the supplement. revision: yes
Circularity Check
No significant circularity in empirical scaling observations
full rationale
The paper reports direct empirical measurements obtained by pre-training 120 models (20K–200M parameters) on the CODE dataset and evaluating performance on held-out in-distribution and OOD clinical tasks. Scaling trends, parameter-efficiency ratios (ResNet vs. Transformer), data-efficiency ratios (SSL vs. SL), and transfer-efficiency numbers are observed outcomes from these experiments rather than quantities algebraically derived from equations that embed the same results. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the reported chain; the central claims remain independent of the inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- model size range
- OOD task selection
axioms (1)
- domain assumption The CODE dataset provides a sufficiently diverse and representative sample of real-world ECG recordings for scaling-law estimation.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
L(N, D) = E + A·N^{-α} + B·D^{-β} (Eq. 1); ΔL_OOD ≈ K·(ΔL_ID)^κ (Eq. 4); ResNets 1.3-2.5× more parameter-efficient than Transformers
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SSL models scale robustly across both model and data sizes; no saturation observed
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Petrus EOGB Abreu, Gabriela MM Paixão, Jiawei Li, Paulo R Gomes, Peter W Macfarlane, Ana Oliveira, Vinicius T Carvalho, Thomas B Schön, Antonio Luiz P Ribeiro, and Antônio H Ribeiro. Code-ii: A large-scale dataset for artificial intelligence in ecg analysis.arXiv preprint arXiv:2511.15632,
-
[2]
William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, and Shinji Watanabe. OWLS: Scaling laws for multilingual speech recognition and translation models.arXiv preprint arXiv:2502.10373,
-
[3]
Edoardo Coppola, Mattia Savardi, Mauro Massussi, Marianna Adamo, Marco Metra, and Alberto Signoroni. HuBERT-ECG as a self-supervised foundation model for broad and scalable cardiac applications.medRxiv, pp. 2024–11,
work page 2024
-
[4]
doi:10.13026/3ykd-bf14. Version 1.1.0. FAIR. fvcore: A library of core computer vision components,
-
[5]
doi:10.13026/4nqg-sb35. Version 1.0. Xiao Gu, Wei Tang, Jinpei Han, Veer Sangha, Fenglin Liu, Shreyank N Gowda, Antonio H Ribeiro, Patrick Schwab, Kim Branson, Lei Clifton, et al. Cardiac health assessment across scenarios and devices using a multimodal foundation model pretrained on data from 1.7 million individuals. Nature Machine Intelligence, 8(2):220–233,
-
[6]
Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer. arXiv preprint arXiv:2102.01293,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier, Daniel F Schmidt, Jonathan Weber, Geoffrey I Webb, Lhassane Idoumghar, Pierre-Alain Muller, and François Petit- jean. InceptionTime: Finding AlexNet for time series classification.Data Mining and Knowledge Discovery, 34(6):1936–1962,
work page 1936
-
[9]
Spencer L James, Degu Abate, Kalkidan Hassen Abate, Solomon M Abay, Cristiana Abbafati, Nooshin Abbasi, Hedayat Abbastabar, Foad Abd-Allah, Jemal Abdela, Ahmed Abdelalim, et al. Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990–2017: a systematic anal...
work page 1990
-
[10]
Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, and Shenda Hong. Reading Your Heart: Learning ECG words and sentences via pre-training ECG language model.arXiv preprint arXiv:2502.10707,
-
[11]
Scaling Laws for Neural Language Models
doi:10.13026/kpb9-mt58. Version 3.1. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.13026/kpb9-mt58 2001
-
[12]
Reformer: The Efficient Transformer
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer.arXiv preprint arXiv:2001.04451,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[13]
Jun Li, Aaron Aguirre, Junior Moura, Che Liu, Lanhai Zhong, Chenxi Sun, Gari Clifford, Brandon Westover, and Shenda Hong. An electrocardiogram foundation model built on over 10 million recordings with external evaluation across multiple domains.arXiv preprint arXiv:2410.04133,
-
[14]
AnyECG: Evolved ECG foundation model for holistic health profiling.arXiv preprint arXiv:2601.10748,
11 Jun Li, Hongling Zhu, Yujie Xiao, Qinghao Zhao, Yalei Ke, Gongzheng Tang, Guangkun Nie, Deyun Zhang, Jin Li, Canqing Yu, et al. AnyECG: Evolved ECG foundation model for holistic health profiling.arXiv preprint arXiv:2601.10748,
-
[15]
Che Liu, Zhongwei Wan, Cheng Ouyang, Anand Shah, Wenjia Bai, and Rossella Arcucci. Zero-Shot ECG classification with multimodal learning and test-time clinical knowledge enhancement.arXiv preprint arXiv:2403.06659,
-
[16]
Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu
doi:10.1166/jmihi.2018.2442. Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu. SCINet: Time series modeling and forecasting with sample convolution and interaction.Advances in Neural Information Processing Systems, 35:5816–5828,
-
[17]
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting.arXiv preprint arXiv:2310.06625,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Benchecg and xecg: a benchmark and baseline for ecg foundation models
Riccardo Lunelli, Angus Nicolson, Samuel Martin Pröll, Sebastian Johannes Reinstadler, Axel Bauer, and Clemens Dlaska. BenchECG and xECG: a benchmark and baseline for ECG foundation models.arXiv preprint arXiv:2509.10151,
-
[19]
ECG-FM: An open electrocardiogram foundation model.arXiv preprint arXiv:2408.05178,
Kaden McKeen, Sameer Masood, Augustin Toma, Barry Rubin, and Bo Wang. ECG-FM: An open electrocardiogram foundation model.arXiv preprint arXiv:2408.05178,
-
[20]
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek
doi:10.17044/scilifelab.15169716.v1. Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek. Deep learning for ECG analysis: Benchmarks and insights from PTB-XL.IEEE journal of biomedical and health informatics, 25(5):1519–1528,
-
[22]
Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias influence scaling? InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 12342–12364,
work page 2023
-
[23]
12 Huiqiang Wang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao
doi:10.1038/s41597-020-0495-6. 12 Huiqiang Wang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao. Micn: Multi-scale local and global context modeling for long-term series forecasting. InInternational Conference on Learning Representations,
-
[24]
TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis
Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. TimesNet: Temporal 2D-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Towards neural scaling laws for time series foundation models.arXiv preprint arXiv:2410.12360,
Qingren Yao, Chao-Han Huck Yang, Renhe Jiang, Yuxuan Liang, Ming Jin, and Shirui Pan. Towards neural scaling laws for time series foundation models.arXiv preprint arXiv:2410.12360,
-
[26]
doi:10.13026/wgex-er52. Version 1.0.0. Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pp. 11106–11115,
-
[27]
provides annotations for six representative rhythm and conduction abnormalities, including first-degree atrioventricular block (1dA Vb,1.5%), right bundle branch block (RBBB, 2.7%), left bundle branch block (LBBB, 1.7%), sinus bradycardia (SB, 1.6%), atrial fibrillation (AF,1.8%), and sinus tachycardia (ST, 2.1%). We do not pretrain on the Harvard–Emory E...
work page 2026
-
[28]
suggests that, under comparable training setups, encoder-only and decoder- only Transformers exhibit similar parameter and data scaling behaviors. For the SSL, we use the off-the-shelf HeartLang codebook (Jin et al., 2025), which was pre-trained on MIMIC-IV ( 0.8M patients), and the random masking rate is 50%. Besides, it should be noted that the dataset ...
work page 2025
-
[29]
ii Table S.2: Dataset-specific data splits, preprocessing procedures, and code availability
40,180 40,180 Two binary labels are included to indicate intensive care unit (ICU) admission: overall ICU stay and ICU admission within 24 hours. ii Table S.2: Dataset-specific data splits, preprocessing procedures, and code availability. All datasets are resampled to 100 Hz to align with the HeartLang codebook. Dataset name Train / Validation / Test spli...
work page 2020
-
[30]
Evaluation setup.For the evaluation, we apply a linear classifier atop the frozen latent representa- tions. This classifier is optimized via the Adam algorithm, configured with a learning rate of10−3 and cross-entropy loss. The linear classifier is trained on the training split of each OOD dataset and validated on the corresponding validation split for a ...
work page 2024
-
[31]
demonstrates that the fine-tuning performance of large models is sensitive to the choice of learning rates and freezing strategies. The evaluation was conducted on either a single NVIDIA A40 GPU (48GB) or an A100 GPU (40GB), depending on the model size. B Supplementary Results B.1 Parameter scaling and data scaling Figure S.2 summarizes the fitted paramet...
-
[32]
Conversely, ResNet-SL maintains robust scaling at larger model size, outperforming ResNet-SSL
In multi-label, long-tailed learning regimes, Transformer-SL exhibits parameter scaling bottlenecks, significantly underperforming its SSL counterpart. Conversely, ResNet-SL maintains robust scaling at larger model size, outperforming ResNet-SSL. B.6 Quantifying overfitting from finite data As shown in Figure S.7 and Figure S.8, the loss trajectories for ...
work page 2019
-
[33]
We plot O against the characteristic variable X=D·(L(N,∞)) 1/β
Panel (c) illustrates the verification of theextent of overfitting, defined as O= (L(N, D)−L(N,∞))/L(N,∞) . We plot O against the characteristic variable X=D·(L(N,∞)) 1/β. The experimental results (blue scatters) align closely with the analytical power-law curves (red dashed lines)O=B·X −β. 104 105 Step 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050Loss ...
work page 2024
-
[34]
Overall, foundation models consistently outperform from-scratch baselines across both architecture families. The best foundation model, ECG-FM, achieves a mean AUROC of 0.882, substantially exceeding the best from-scratch CNN (InceptionTime, 0.812) and the best from-scratch Transformer (PatchTST,0.784). These results confirm that large-scale pre-training ...
work page 2018
-
[35]
Best and second-best are computed within each architecture group
0.906 0.755 0.876 0.736 0.799 0.947 0.837 39.1M 0.8M SSL xiii Table S.10: Macro AUROC and macro F1 scores on PTB-XL benchmarks. Best and second-best are computed within each architecture group. Fine-tuned models are highlighted in blue. Method PTBXL (Diag.) PTBXL (Sub) PTBXL (Sup) PTBXL (Form) PTBXL (Rhythm) macro-AUC (avg) AUC F1 AUC F1 AUC F1 AUC F1 AUC...
-
[36]
contains a larger set of labels than CODE, it has not yet been publicly released. Furthermore, this study considers two mainstream ECG models, ResNet and Transformer, as they have proven effective and are widely used as foundation architectures. A limitation is that other architectures, such as state-space models, were not systematically evaluated. Becaus...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.