How Do Electrocardiogram Models Scale?

Ant\^onio H. Ribeiro; Fabio Bonassi; Jiawei Li; Johan Sundstr\"om; Ming Jin; Stefan Gustafsson; Thomas B. Sch\"on

arxiv: 2605.17276 · v1 · pith:GIFZ2KDSnew · submitted 2026-05-17 · 💻 cs.LG · cs.AI

How Do Electrocardiogram Models Scale?

Jiawei Li , Fabio Bonassi , Ming Jin , Stefan Gustafsson , Johan Sundstr\"om , Thomas B. Sch\"on , Ant\^onio H. Ribeiro This is my paper

Pith reviewed 2026-05-20 13:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords electrocardiogramscaling lawsself-supervised learningResNetTransformerout-of-distribution generalizationfoundation modelstransfer efficiency

0 comments

The pith

Self-supervised learning enables robust scaling of ECG models with both model and data size, while ResNets are more parameter-efficient than Transformers for out-of-distribution tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines scaling behaviors in electrocardiogram models by training 120 different models on a large dataset of 2.3 million ECG records. It separates the impact of choosing between ResNet and Transformer architectures from the choice between supervised and self-supervised training methods. The results indicate that self-supervised models continue to improve as they grow larger and train on more data, in contrast to supervised models that become limited by data availability. For generalizing to new clinical situations not seen during training, self-supervised learning shows much higher efficiency in using data and transferring knowledge, and ResNets require fewer parameters than Transformers to reach good performance. These insights suggest that building strong ECG models requires matching the right architecture with the right training approach.

Core claim

Pre-training 120 models from 20K to 200M parameters on the CODE dataset of 2.3M records reveals that SL models are data-bottlenecked in-distribution, whereas SSL models scale robustly across both model and data sizes. For OOD generalization, ResNets are 1.3 to 2.5 times more parameter-efficient than Transformers, while SSL is up to 16 times more data-efficient and achieves up to 7.6 times higher transfer efficiency than SL on unseen clinical tasks. ResNet-based models generally achieve the lowest OOD loss, with SSL dominating on unseen clinical tasks and self-supervised Transformers overtaking at very large model sizes.

What carries the argument

The decoupling of architecture choice (ResNet vs Transformer) and pre-training paradigm (SL vs SSL) through systematic scaling experiments on the CODE dataset.

If this is right

SSL models will continue to benefit from increases in both model size and pre-training data size for in-distribution performance.
ResNet architectures will require 1.3-2.5 times fewer parameters than Transformers to achieve equivalent OOD generalization in ECG tasks.
SSL pre-training will provide up to 16 times better data efficiency and 7.6 times higher transfer efficiency to new clinical tasks compared to SL.
Self-supervised Transformers may outperform ResNets when model sizes exceed the tested range.
The most effective ECG foundation models will result from aligning architecture and pre-training paradigm rather than relying on larger scales alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clinicians might prefer smaller ResNet-based SSL models for practical deployment because of their efficiency advantages.
These efficiency patterns could guide scaling strategies for other biomedical time-series signals beyond ECG.
Testing the same architectures on additional hospital datasets would check whether the reported data and transfer efficiencies generalize.

Load-bearing premise

The scaling trends and efficiency advantages observed for SSL and ResNets on the CODE dataset will persist when applied to larger models, different data sources, or additional clinical tasks.

What would settle it

Training models larger than 200M parameters or evaluating on ECG records from an unseen hospital system and finding that SSL no longer scales or that Transformers match or exceed ResNet efficiency would falsify the central claims.

Figures

Figures reproduced from arXiv: 2605.17276 by Ant\^onio H. Ribeiro, Fabio Bonassi, Jiawei Li, Johan Sundstr\"om, Ming Jin, Stefan Gustafsson, Thomas B. Sch\"on.

**Figure 2.** Figure 2: ID and OOD results of Transformer-SL, Transformer-SSL, ResNet-SL, and ResNet-SSL. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Loss-to-loss scaling curves for the four architecture-paradigm combinations. The dashed [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Scaling behavior for four architectural paradigms. Top and bottom rows illustrate OOD [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: OOD evaluation on CPSC2018. (a) Empirical frontier of test loss versus pre-training FLOPs. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Label scaling benefits ID performance, but OOD transfer depends on label selection [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

While scaling laws have established a fundamental framework for foundation models in natural language processing, their applicability to electrocardiogram (ECG) models remains poorly characterized. Indeed, recent studies do not always yield consistent downstream gains as one increases the model size or pre-training dataset size of ECG models, leaving the exact roles of architectural inductive biases, pre-training paradigms, and expected improvements with size largely unanswered. In this work, we systematically investigate neural and loss-to-loss scaling laws within the ECG domain. By pre-training over $120$ models (ranging from $20$K to $200$M parameters) on the large-scale CODE dataset ($2.3$M records), we decouple the effects of model architecture (ResNet vs. Transformer) and pre-training paradigm, namely supervised learning (SL) versus self-supervised learning (SSL). We found that (i) SL models are data-bottlenecked in-distribution, whereas SSL models scale robustly across both model and data sizes; (ii) for out-of-distribution (OOD) generalization, ResNets are $1.3$ to $2.5$ times more parameter-efficient than Transformers, while SSL is up to $16$ times more data-efficient and achieves up to $7.6$ times higher transfer efficiency than SL on unseen clinical tasks; (iii) across the observed scales, ResNet-based models generally achieve the lowest OOD loss, with SSL dominating on unseen clinical tasks and self-supervised Transformers overtaking at very large model sizes. Our results suggest that the path to effective ECG foundation models lies in the strategic alignment of architecture and paradigm rather than brute-force scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SSL scales more reliably than SL on ECG data and ResNets hold a modest parameter-efficiency edge for OOD tasks, but the size of those edges depends on whether hyperparameter budgets were truly matched.

read the letter

The main things to know are that self-supervised pre-training avoids the data bottlenecks seen in supervised ECG models and that ResNets come out 1.3-2.5 times more parameter-efficient than Transformers on out-of-distribution clinical tasks, with SSL also showing large data-efficiency gains up to 16x and better transfer up to 7.6x. The study ran 120 models from 20k to 200M parameters on the 2.3M-record CODE dataset and measured both in-distribution scaling and transfer to held-out tasks. That scale and the explicit decoupling of architecture from pre-training paradigm are the clearest contributions. The results give usable numbers for anyone deciding between ResNet and Transformer backbones or between SL and SSL when compute is limited. The trends look consistent within the tested range, and the OOD focus makes the efficiency claims more relevant than pure in-distribution scaling curves. The main soft spot is whether the hyperparameter search was equally thorough for both architectures. Transformers on signal data often need different learning-rate schedules or regularization, and if the search budget or space was not matched the reported gaps could shrink. The abstract does not mention error bars or the exact fitting procedure for the scaling relations, so those details will matter for readers who want to treat the multipliers as reliable. This paper is aimed at groups building or benchmarking ECG foundation models who need guidance on architecture and pre-training choices rather than pure theory. It is solid enough empirical work to deserve a serious referee, mainly for the scale of the controlled comparison and the practical questions it addresses. I would send it for review with requests for the hyperparameter protocol and any statistical support for the efficiency ratios.

Referee Report

3 major / 1 minor

Summary. The paper systematically studies scaling laws for ECG models by pre-training 120 models (20K–200M parameters) on the CODE dataset (2.3M records). It decouples architecture (ResNet vs. Transformer) from pre-training paradigm (SL vs. SSL) and reports that SSL models scale robustly across model and data sizes; for OOD generalization ResNets are 1.3–2.5× more parameter-efficient than Transformers, SSL is up to 16× more data-efficient, and SSL achieves up to 7.6× higher transfer efficiency than SL on unseen clinical tasks. The authors conclude that strategic alignment of architecture and paradigm, rather than brute-force scaling, is the path to effective ECG foundation models.

Significance. If the reported scaling behaviors and efficiency ratios hold under matched optimization, the work supplies the first large-scale empirical map of neural and loss-to-loss scaling in the ECG domain. The scale of the experiment (120 models) and the explicit decoupling of architecture from paradigm constitute a clear advance over prior inconsistent scaling observations in ECG literature.

major comments (3)

[Abstract / Results] Abstract and Results: the reported efficiency multipliers (ResNets 1.3–2.5× more parameter-efficient; SSL up to 16× more data-efficient and 7.6× higher transfer efficiency) are presented without error bars, confidence intervals, or statistical significance tests. This omission makes it impossible to judge whether the gaps exceed run-to-run variability.
[Results] Results (OOD generalization claims): the central comparative result that ResNets are more parameter-efficient than Transformers for OOD tasks assumes comparable hyperparameter optimization across architectures. The manuscript does not state whether learning-rate schedules, warmup, or regularization search budgets were matched for Transformers, which typically require distinct tuning on time-series data; unequal tuning could produce the observed 1.3–2.5× gap without reflecting intrinsic architectural differences.
[Methods] Methods / Experimental setup: the exact functional form fitted to obtain the scaling laws and the criteria used to designate tasks as OOD are not specified. Post-hoc selection of which downstream tasks count as OOD could inflate the reported transfer-efficiency multipliers.

minor comments (1)

[Abstract] Abstract: the terms 'neural and loss-to-loss scaling laws' are used without a brief definition or reference, which may hinder readers outside the immediate subfield.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important aspects of statistical rigor, experimental fairness, and methodological transparency that will strengthen the manuscript. We address each major comment below and indicate the revisions we will implement.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results: the reported efficiency multipliers (ResNets 1.3–2.5× more parameter-efficient; SSL up to 16× more data-efficient and 7.6× higher transfer efficiency) are presented without error bars, confidence intervals, or statistical significance tests. This omission makes it impossible to judge whether the gaps exceed run-to-run variability.

Authors: We agree that reporting variability is essential for interpreting the efficiency multipliers. In the revised manuscript we will add error bars derived from multiple independent training runs (different random seeds) or bootstrap resampling for the key ratios. We will also include statistical significance tests (e.g., paired t-tests across matched runs) to confirm that the reported gaps exceed run-to-run variability. These additions will be placed in both the abstract summary and the main Results section. revision: yes
Referee: [Results] Results (OOD generalization claims): the central comparative result that ResNets are more parameter-efficient than Transformers for OOD tasks assumes comparable hyperparameter optimization across architectures. The manuscript does not state whether learning-rate schedules, warmup, or regularization search budgets were matched for Transformers, which typically require distinct tuning on time-series data; unequal tuning could produce the observed 1.3–2.5× gap without reflecting intrinsic architectural differences.

Authors: We acknowledge the importance of documenting hyperparameter fairness. Our protocol performed architecture-specific grid searches over learning rate, warmup steps, batch size, and regularization strength, with separate budgets allocated to ResNets and Transformers to accommodate time-series characteristics. In the revision we will expand the Methods section with a table listing the search ranges, number of trials per architecture, and final selected hyperparameters. This documentation will clarify that tuning effort was matched to the extent permitted by compute limits. We remain open to additional targeted experiments if the referee recommends specific configurations. revision: partial
Referee: [Methods] Methods / Experimental setup: the exact functional form fitted to obtain the scaling laws and the criteria used to designate tasks as OOD are not specified. Post-hoc selection of which downstream tasks count as OOD could inflate the reported transfer-efficiency multipliers.

Authors: We thank the referee for noting these omissions. Scaling laws were fitted with the power-law form L(x) = a·x^(-b) + c using nonlinear least-squares optimization; we will state this functional form explicitly in the revised Methods. OOD tasks were defined as clinical prediction problems absent from pre-training with measurable distribution shifts in patient population or acquisition conditions. To address post-hoc selection concerns we will add a sensitivity analysis showing transfer-efficiency multipliers under alternative OOD groupings. Both the fitting procedure and OOD criteria will be moved into the main text with supporting details in the supplement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical scaling observations

full rationale

The paper reports direct empirical measurements obtained by pre-training 120 models (20K–200M parameters) on the CODE dataset and evaluating performance on held-out in-distribution and OOD clinical tasks. Scaling trends, parameter-efficiency ratios (ResNet vs. Transformer), data-efficiency ratios (SSL vs. SL), and transfer-efficiency numbers are observed outcomes from these experiments rather than quantities algebraically derived from equations that embed the same results. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the reported chain; the central claims remain independent of the inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the CODE dataset distribution and the selected OOD clinical tasks are representative of real-world ECG deployment; no new physical entities or mathematical axioms are introduced.

free parameters (2)

model size range
Models from 20K to 200M parameters were chosen; the exact parameterization schedule is a modeling choice.
OOD task selection
Which clinical tasks count as out-of-distribution is defined by the authors and affects the reported transfer-efficiency multipliers.

axioms (1)

domain assumption The CODE dataset provides a sufficiently diverse and representative sample of real-world ECG recordings for scaling-law estimation.
All scaling and efficiency conclusions are conditioned on performance measured within and across splits of this single large dataset.

pith-pipeline@v0.9.0 · 5844 in / 1351 out tokens · 31854 ms · 2026-05-20T13:42:19.009416+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L(N, D) = E + A·N^{-α} + B·D^{-β} (Eq. 1); ΔL_OOD ≈ K·(ΔL_ID)^κ (Eq. 4); ResNets 1.3-2.5× more parameter-efficient than Transformers
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SSL models scale robustly across both model and data sizes; no saturation observed

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 7 internal anchors

[1]

Code-ii: A large-scale dataset for artificial intelligence in ecg analysis.arXiv preprint arXiv:2511.15632,

Petrus EOGB Abreu, Gabriela MM Paixão, Jiawei Li, Paulo R Gomes, Peter W Macfarlane, Ana Oliveira, Vinicius T Carvalho, Thomas B Schön, Antonio Luiz P Ribeiro, and Antônio H Ribeiro. Code-ii: A large-scale dataset for artificial intelligence in ecg analysis.arXiv preprint arXiv:2511.15632,

work page arXiv
[2]

OWLS: Scaling laws for multilingual speech recognition and translation models.arXiv preprint arXiv:2502.10373,

William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, and Shinji Watanabe. OWLS: Scaling laws for multilingual speech recognition and translation models.arXiv preprint arXiv:2502.10373,

work page arXiv
[3]

HuBERT-ECG as a self-supervised foundation model for broad and scalable cardiac applications.medRxiv, pp

Edoardo Coppola, Mattia Savardi, Mauro Massussi, Marianna Adamo, Marco Metra, and Alberto Signoroni. HuBERT-ECG as a self-supervised foundation model for broad and scalable cardiac applications.medRxiv, pp. 2024–11,

work page 2024
[4]

Version 1.1.0

doi:10.13026/3ykd-bf14. Version 1.1.0. FAIR. fvcore: A library of core computer vision components,

work page doi:10.13026/3ykd-bf14
[5]

Version 1.0

doi:10.13026/4nqg-sb35. Version 1.0. Xiao Gu, Wei Tang, Jinpei Han, Veer Sangha, Fenglin Liu, Shreyank N Gowda, Antonio H Ribeiro, Patrick Schwab, Kim Branson, Lei Clifton, et al. Cardiac health assessment across scenarios and devices using a multimodal foundation model pretrained on data from 1.7 million individuals. Nature Machine Intelligence, 8(2):220–233,

work page doi:10.13026/4nqg-sb35
[6]

Scaling Laws for Transfer

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer. arXiv preprint arXiv:2102.01293,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

InceptionTime: Finding AlexNet for time series classification.Data Mining and Knowledge Discovery, 34(6):1936–1962,

Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier, Daniel F Schmidt, Jonathan Weber, Geoffrey I Webb, Lhassane Idoumghar, Pierre-Alain Muller, and François Petit- jean. InceptionTime: Finding AlexNet for time series classification.Data Mining and Knowledge Discovery, 34(6):1936–1962,

work page 1936
[9]

Spencer L James, Degu Abate, Kalkidan Hassen Abate, Solomon M Abay, Cristiana Abbafati, Nooshin Abbasi, Hedayat Abbastabar, Foad Abd-Allah, Jemal Abdela, Ahmed Abdelalim, et al. Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990–2017: a systematic anal...

work page 1990
[10]

Reading Your Heart: Learning ECG words and sentences via pre-training ECG language model.arXiv preprint arXiv:2502.10707,

Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, and Shenda Hong. Reading Your Heart: Learning ECG words and sentences via pre-training ECG language model.arXiv preprint arXiv:2502.10707,

work page arXiv
[11]

Scaling Laws for Neural Language Models

doi:10.13026/kpb9-mt58. Version 3.1. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.13026/kpb9-mt58 2001
[12]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer.arXiv preprint arXiv:2001.04451,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[13]

An electrocardiogram foundation model built on over 10 million recordings with external evaluation across multiple domains.arXiv preprint arXiv:2410.04133,

Jun Li, Aaron Aguirre, Junior Moura, Che Liu, Lanhai Zhong, Chenxi Sun, Gari Clifford, Brandon Westover, and Shenda Hong. An electrocardiogram foundation model built on over 10 million recordings with external evaluation across multiple domains.arXiv preprint arXiv:2410.04133,

work page arXiv
[14]

AnyECG: Evolved ECG foundation model for holistic health profiling.arXiv preprint arXiv:2601.10748,

11 Jun Li, Hongling Zhu, Yujie Xiao, Qinghao Zhao, Yalei Ke, Gongzheng Tang, Guangkun Nie, Deyun Zhang, Jin Li, Canqing Yu, et al. AnyECG: Evolved ECG foundation model for holistic health profiling.arXiv preprint arXiv:2601.10748,

work page arXiv
[15]

Zero-Shot ECG classification with multimodal learning and test-time clinical knowledge enhancement.arXiv preprint arXiv:2403.06659,

Che Liu, Zhongwei Wan, Cheng Ouyang, Anand Shah, Wenjia Bai, and Rossella Arcucci. Zero-Shot ECG classification with multimodal learning and test-time clinical knowledge enhancement.arXiv preprint arXiv:2403.06659,

work page arXiv
[16]

Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu

doi:10.1166/jmihi.2018.2442. Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu. SCINet: Time series modeling and forecasting with sample convolution and interaction.Advances in Neural Information Processing Systems, 35:5816–5828,

work page doi:10.1166/jmihi.2018.2442 2018
[17]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting.arXiv preprint arXiv:2310.06625,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Benchecg and xecg: a benchmark and baseline for ecg foundation models

Riccardo Lunelli, Angus Nicolson, Samuel Martin Pröll, Sebastian Johannes Reinstadler, Axel Bauer, and Clemens Dlaska. BenchECG and xECG: a benchmark and baseline for ECG foundation models.arXiv preprint arXiv:2509.10151,

work page arXiv
[19]

ECG-FM: An open electrocardiogram foundation model.arXiv preprint arXiv:2408.05178,

Kaden McKeen, Sameer Masood, Augustin Toma, Barry Rubin, and Bo Wang. ECG-FM: An open electrocardiogram foundation model.arXiv preprint arXiv:2408.05178,

work page arXiv
[20]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek

doi:10.17044/scilifelab.15169716.v1. Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek. Deep learning for ECG analysis: Benchmarks and insights from PTB-XL.IEEE journal of biomedical and health informatics, 25(5):1519–1528,

work page doi:10.17044/scilifelab.15169716.v1
[22]

Scaling laws vs model architectures: How does inductive bias influence scaling? InFindings of the Association for Computational Linguistics: EMNLP 2023, pp

Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias influence scaling? InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 12342–12364,

work page 2023
[23]

12 Huiqiang Wang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao

doi:10.1038/s41597-020-0495-6. 12 Huiqiang Wang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao. Micn: Multi-scale local and global context modeling for long-term series forecasting. InInternational Conference on Learning Representations,

work page doi:10.1038/s41597-020-0495-6
[24]

TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. TimesNet: Temporal 2D-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Towards neural scaling laws for time series foundation models.arXiv preprint arXiv:2410.12360,

Qingren Yao, Chao-Han Huck Yang, Renhe Jiang, Yuxuan Liang, Ming Jin, and Shirui Pan. Towards neural scaling laws for time series foundation models.arXiv preprint arXiv:2410.12360,

work page arXiv
[26]

Version 1.0.0

doi:10.13026/wgex-er52. Version 1.0.0. Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pp. 11106–11115,

work page doi:10.13026/wgex-er52
[27]

provides annotations for six representative rhythm and conduction abnormalities, including first-degree atrioventricular block (1dA Vb,1.5%), right bundle branch block (RBBB, 2.7%), left bundle branch block (LBBB, 1.7%), sinus bradycardia (SB, 1.6%), atrial fibrillation (AF,1.8%), and sinus tachycardia (ST, 2.1%). We do not pretrain on the Harvard–Emory E...

work page 2026
[28]

For the SSL, we use the off-the-shelf HeartLang codebook (Jin et al., 2025), which was pre-trained on MIMIC-IV ( 0.8M patients), and the random masking rate is 50%

suggests that, under comparable training setups, encoder-only and decoder- only Transformers exhibit similar parameter and data scaling behaviors. For the SSL, we use the off-the-shelf HeartLang codebook (Jin et al., 2025), which was pre-trained on MIMIC-IV ( 0.8M patients), and the random masking rate is 50%. Besides, it should be noted that the dataset ...

work page 2025
[29]

ii Table S.2: Dataset-specific data splits, preprocessing procedures, and code availability

40,180 40,180 Two binary labels are included to indicate intensive care unit (ICU) admission: overall ICU stay and ICU admission within 24 hours. ii Table S.2: Dataset-specific data splits, preprocessing procedures, and code availability. All datasets are resampled to 100 Hz to align with the HeartLang codebook. Dataset name Train / Validation / Test spli...

work page 2020
[30]

This classifier is optimized via the Adam algorithm, configured with a learning rate of10−3 and cross-entropy loss

Evaluation setup.For the evaluation, we apply a linear classifier atop the frozen latent representa- tions. This classifier is optimized via the Adam algorithm, configured with a learning rate of10−3 and cross-entropy loss. The linear classifier is trained on the training split of each OOD dataset and validated on the corresponding validation split for a ...

work page 2024
[31]

The evaluation was conducted on either a single NVIDIA A40 GPU (48GB) or an A100 GPU (40GB), depending on the model size

demonstrates that the fine-tuning performance of large models is sensitive to the choice of learning rates and freezing strategies. The evaluation was conducted on either a single NVIDIA A40 GPU (48GB) or an A100 GPU (40GB), depending on the model size. B Supplementary Results B.1 Parameter scaling and data scaling Figure S.2 summarizes the fitted paramet...

work page arXiv 1999
[32]

Conversely, ResNet-SL maintains robust scaling at larger model size, outperforming ResNet-SSL

In multi-label, long-tailed learning regimes, Transformer-SL exhibits parameter scaling bottlenecks, significantly underperforming its SSL counterpart. Conversely, ResNet-SL maintains robust scaling at larger model size, outperforming ResNet-SSL. B.6 Quantifying overfitting from finite data As shown in Figure S.7 and Figure S.8, the loss trajectories for ...

work page 2019
[33]

We plot O against the characteristic variable X=D·(L(N,∞)) 1/β

Panel (c) illustrates the verification of theextent of overfitting, defined as O= (L(N, D)−L(N,∞))/L(N,∞) . We plot O against the characteristic variable X=D·(L(N,∞)) 1/β. The experimental results (blue scatters) align closely with the analytical power-law curves (red dashed lines)O=B·X −β. 104 105 Step 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050Loss ...

work page 2024
[34]

Overall, foundation models consistently outperform from-scratch baselines across both architecture families. The best foundation model, ECG-FM, achieves a mean AUROC of 0.882, substantially exceeding the best from-scratch CNN (InceptionTime, 0.812) and the best from-scratch Transformer (PatchTST,0.784). These results confirm that large-scale pre-training ...

work page 2018
[35]

Best and second-best are computed within each architecture group

0.906 0.755 0.876 0.736 0.799 0.947 0.837 39.1M 0.8M SSL xiii Table S.10: Macro AUROC and macro F1 scores on PTB-XL benchmarks. Best and second-best are computed within each architecture group. Fine-tuned models are highlighted in blue. Method PTBXL (Diag.) PTBXL (Sub) PTBXL (Sup) PTBXL (Form) PTBXL (Rhythm) macro-AUC (avg) AUC F1 AUC F1 AUC F1 AUC F1 AUC...

work page arXiv 2060
[36]

Furthermore, this study considers two mainstream ECG models, ResNet and Transformer, as they have proven effective and are widely used as foundation architectures

contains a larger set of labels than CODE, it has not yet been publicly released. Furthermore, this study considers two mainstream ECG models, ResNet and Transformer, as they have proven effective and are widely used as foundation architectures. A limitation is that other architectures, such as state-space models, were not systematically evaluated. Becaus...

work page 2026

[1] [1]

Code-ii: A large-scale dataset for artificial intelligence in ecg analysis.arXiv preprint arXiv:2511.15632,

Petrus EOGB Abreu, Gabriela MM Paixão, Jiawei Li, Paulo R Gomes, Peter W Macfarlane, Ana Oliveira, Vinicius T Carvalho, Thomas B Schön, Antonio Luiz P Ribeiro, and Antônio H Ribeiro. Code-ii: A large-scale dataset for artificial intelligence in ecg analysis.arXiv preprint arXiv:2511.15632,

work page arXiv

[2] [2]

OWLS: Scaling laws for multilingual speech recognition and translation models.arXiv preprint arXiv:2502.10373,

William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, and Shinji Watanabe. OWLS: Scaling laws for multilingual speech recognition and translation models.arXiv preprint arXiv:2502.10373,

work page arXiv

[3] [3]

HuBERT-ECG as a self-supervised foundation model for broad and scalable cardiac applications.medRxiv, pp

Edoardo Coppola, Mattia Savardi, Mauro Massussi, Marianna Adamo, Marco Metra, and Alberto Signoroni. HuBERT-ECG as a self-supervised foundation model for broad and scalable cardiac applications.medRxiv, pp. 2024–11,

work page 2024

[4] [4]

Version 1.1.0

doi:10.13026/3ykd-bf14. Version 1.1.0. FAIR. fvcore: A library of core computer vision components,

work page doi:10.13026/3ykd-bf14

[5] [5]

Version 1.0

doi:10.13026/4nqg-sb35. Version 1.0. Xiao Gu, Wei Tang, Jinpei Han, Veer Sangha, Fenglin Liu, Shreyank N Gowda, Antonio H Ribeiro, Patrick Schwab, Kim Branson, Lei Clifton, et al. Cardiac health assessment across scenarios and devices using a multimodal foundation model pretrained on data from 1.7 million individuals. Nature Machine Intelligence, 8(2):220–233,

work page doi:10.13026/4nqg-sb35

[6] [6]

Scaling Laws for Transfer

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer. arXiv preprint arXiv:2102.01293,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

InceptionTime: Finding AlexNet for time series classification.Data Mining and Knowledge Discovery, 34(6):1936–1962,

Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier, Daniel F Schmidt, Jonathan Weber, Geoffrey I Webb, Lhassane Idoumghar, Pierre-Alain Muller, and François Petit- jean. InceptionTime: Finding AlexNet for time series classification.Data Mining and Knowledge Discovery, 34(6):1936–1962,

work page 1936

[9] [9]

Spencer L James, Degu Abate, Kalkidan Hassen Abate, Solomon M Abay, Cristiana Abbafati, Nooshin Abbasi, Hedayat Abbastabar, Foad Abd-Allah, Jemal Abdela, Ahmed Abdelalim, et al. Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990–2017: a systematic anal...

work page 1990

[10] [10]

Reading Your Heart: Learning ECG words and sentences via pre-training ECG language model.arXiv preprint arXiv:2502.10707,

Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, and Shenda Hong. Reading Your Heart: Learning ECG words and sentences via pre-training ECG language model.arXiv preprint arXiv:2502.10707,

work page arXiv

[11] [11]

Scaling Laws for Neural Language Models

doi:10.13026/kpb9-mt58. Version 3.1. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.13026/kpb9-mt58 2001

[12] [12]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer.arXiv preprint arXiv:2001.04451,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[13] [13]

An electrocardiogram foundation model built on over 10 million recordings with external evaluation across multiple domains.arXiv preprint arXiv:2410.04133,

Jun Li, Aaron Aguirre, Junior Moura, Che Liu, Lanhai Zhong, Chenxi Sun, Gari Clifford, Brandon Westover, and Shenda Hong. An electrocardiogram foundation model built on over 10 million recordings with external evaluation across multiple domains.arXiv preprint arXiv:2410.04133,

work page arXiv

[14] [14]

AnyECG: Evolved ECG foundation model for holistic health profiling.arXiv preprint arXiv:2601.10748,

11 Jun Li, Hongling Zhu, Yujie Xiao, Qinghao Zhao, Yalei Ke, Gongzheng Tang, Guangkun Nie, Deyun Zhang, Jin Li, Canqing Yu, et al. AnyECG: Evolved ECG foundation model for holistic health profiling.arXiv preprint arXiv:2601.10748,

work page arXiv

[15] [15]

Zero-Shot ECG classification with multimodal learning and test-time clinical knowledge enhancement.arXiv preprint arXiv:2403.06659,

Che Liu, Zhongwei Wan, Cheng Ouyang, Anand Shah, Wenjia Bai, and Rossella Arcucci. Zero-Shot ECG classification with multimodal learning and test-time clinical knowledge enhancement.arXiv preprint arXiv:2403.06659,

work page arXiv

[16] [16]

Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu

doi:10.1166/jmihi.2018.2442. Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu. SCINet: Time series modeling and forecasting with sample convolution and interaction.Advances in Neural Information Processing Systems, 35:5816–5828,

work page doi:10.1166/jmihi.2018.2442 2018

[17] [17]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting.arXiv preprint arXiv:2310.06625,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Benchecg and xecg: a benchmark and baseline for ecg foundation models

Riccardo Lunelli, Angus Nicolson, Samuel Martin Pröll, Sebastian Johannes Reinstadler, Axel Bauer, and Clemens Dlaska. BenchECG and xECG: a benchmark and baseline for ECG foundation models.arXiv preprint arXiv:2509.10151,

work page arXiv

[19] [19]

ECG-FM: An open electrocardiogram foundation model.arXiv preprint arXiv:2408.05178,

Kaden McKeen, Sameer Masood, Augustin Toma, Barry Rubin, and Bo Wang. ECG-FM: An open electrocardiogram foundation model.arXiv preprint arXiv:2408.05178,

work page arXiv

[20] [20]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek

doi:10.17044/scilifelab.15169716.v1. Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek. Deep learning for ECG analysis: Benchmarks and insights from PTB-XL.IEEE journal of biomedical and health informatics, 25(5):1519–1528,

work page doi:10.17044/scilifelab.15169716.v1

[22] [22]

Scaling laws vs model architectures: How does inductive bias influence scaling? InFindings of the Association for Computational Linguistics: EMNLP 2023, pp

Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias influence scaling? InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 12342–12364,

work page 2023

[23] [23]

12 Huiqiang Wang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao

doi:10.1038/s41597-020-0495-6. 12 Huiqiang Wang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao. Micn: Multi-scale local and global context modeling for long-term series forecasting. InInternational Conference on Learning Representations,

work page doi:10.1038/s41597-020-0495-6

[24] [24]

TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. TimesNet: Temporal 2D-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Towards neural scaling laws for time series foundation models.arXiv preprint arXiv:2410.12360,

Qingren Yao, Chao-Han Huck Yang, Renhe Jiang, Yuxuan Liang, Ming Jin, and Shirui Pan. Towards neural scaling laws for time series foundation models.arXiv preprint arXiv:2410.12360,

work page arXiv

[26] [26]

Version 1.0.0

doi:10.13026/wgex-er52. Version 1.0.0. Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pp. 11106–11115,

work page doi:10.13026/wgex-er52

[27] [27]

provides annotations for six representative rhythm and conduction abnormalities, including first-degree atrioventricular block (1dA Vb,1.5%), right bundle branch block (RBBB, 2.7%), left bundle branch block (LBBB, 1.7%), sinus bradycardia (SB, 1.6%), atrial fibrillation (AF,1.8%), and sinus tachycardia (ST, 2.1%). We do not pretrain on the Harvard–Emory E...

work page 2026

[28] [28]

For the SSL, we use the off-the-shelf HeartLang codebook (Jin et al., 2025), which was pre-trained on MIMIC-IV ( 0.8M patients), and the random masking rate is 50%

suggests that, under comparable training setups, encoder-only and decoder- only Transformers exhibit similar parameter and data scaling behaviors. For the SSL, we use the off-the-shelf HeartLang codebook (Jin et al., 2025), which was pre-trained on MIMIC-IV ( 0.8M patients), and the random masking rate is 50%. Besides, it should be noted that the dataset ...

work page 2025

[29] [29]

ii Table S.2: Dataset-specific data splits, preprocessing procedures, and code availability

40,180 40,180 Two binary labels are included to indicate intensive care unit (ICU) admission: overall ICU stay and ICU admission within 24 hours. ii Table S.2: Dataset-specific data splits, preprocessing procedures, and code availability. All datasets are resampled to 100 Hz to align with the HeartLang codebook. Dataset name Train / Validation / Test spli...

work page 2020

[30] [30]

This classifier is optimized via the Adam algorithm, configured with a learning rate of10−3 and cross-entropy loss

Evaluation setup.For the evaluation, we apply a linear classifier atop the frozen latent representa- tions. This classifier is optimized via the Adam algorithm, configured with a learning rate of10−3 and cross-entropy loss. The linear classifier is trained on the training split of each OOD dataset and validated on the corresponding validation split for a ...

work page 2024

[31] [31]

The evaluation was conducted on either a single NVIDIA A40 GPU (48GB) or an A100 GPU (40GB), depending on the model size

demonstrates that the fine-tuning performance of large models is sensitive to the choice of learning rates and freezing strategies. The evaluation was conducted on either a single NVIDIA A40 GPU (48GB) or an A100 GPU (40GB), depending on the model size. B Supplementary Results B.1 Parameter scaling and data scaling Figure S.2 summarizes the fitted paramet...

work page arXiv 1999

[32] [32]

Conversely, ResNet-SL maintains robust scaling at larger model size, outperforming ResNet-SSL

In multi-label, long-tailed learning regimes, Transformer-SL exhibits parameter scaling bottlenecks, significantly underperforming its SSL counterpart. Conversely, ResNet-SL maintains robust scaling at larger model size, outperforming ResNet-SSL. B.6 Quantifying overfitting from finite data As shown in Figure S.7 and Figure S.8, the loss trajectories for ...

work page 2019

[33] [33]

We plot O against the characteristic variable X=D·(L(N,∞)) 1/β

Panel (c) illustrates the verification of theextent of overfitting, defined as O= (L(N, D)−L(N,∞))/L(N,∞) . We plot O against the characteristic variable X=D·(L(N,∞)) 1/β. The experimental results (blue scatters) align closely with the analytical power-law curves (red dashed lines)O=B·X −β. 104 105 Step 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050Loss ...

work page 2024

[34] [34]

Overall, foundation models consistently outperform from-scratch baselines across both architecture families. The best foundation model, ECG-FM, achieves a mean AUROC of 0.882, substantially exceeding the best from-scratch CNN (InceptionTime, 0.812) and the best from-scratch Transformer (PatchTST,0.784). These results confirm that large-scale pre-training ...

work page 2018

[35] [35]

Best and second-best are computed within each architecture group

0.906 0.755 0.876 0.736 0.799 0.947 0.837 39.1M 0.8M SSL xiii Table S.10: Macro AUROC and macro F1 scores on PTB-XL benchmarks. Best and second-best are computed within each architecture group. Fine-tuned models are highlighted in blue. Method PTBXL (Diag.) PTBXL (Sub) PTBXL (Sup) PTBXL (Form) PTBXL (Rhythm) macro-AUC (avg) AUC F1 AUC F1 AUC F1 AUC F1 AUC...

work page arXiv 2060

[36] [36]

Furthermore, this study considers two mainstream ECG models, ResNet and Transformer, as they have proven effective and are widely used as foundation architectures

contains a larger set of labels than CODE, it has not yet been publicly released. Furthermore, this study considers two mainstream ECG models, ResNet and Transformer, as they have proven effective and are widely used as foundation architectures. A limitation is that other architectures, such as state-space models, were not systematically evaluated. Becaus...

work page 2026