pith. sign in

arxiv: 2606.07365 · v1 · pith:4JPGVCJYnew · submitted 2026-06-05 · 💻 cs.LG · cs.AI

A robust PPG foundation model using multimodal physiological supervision

Pith reviewed 2026-06-27 22:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords PPGfoundation modelmultimodal supervisioncontrastive learningphotoplethysmographyECGgeneralizationwearables
0
0 comments X

The pith

Using ECG and respiratory signals to pick contrastive PPG samples yields a foundation model that beats priors on 14 of 15 tasks after pretraining on one-third the subjects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes pretraining a PPG foundation model on ICU data by using accompanying ECG and respiratory signals to select contrastive samples. This lets the model keep and learn from noisy PPG segments instead of discarding them or requiring curated high-quality data. Pretrained on three times fewer subjects than prior state-of-the-art approaches, the resulting model improves performance on fourteen of fifteen downstream tasks that include field-like daily activity recognition and heart rate prediction. The results indicate that multimodal physiological supervision can integrate complementary signals to produce PPG representations that generalize more robustly to consumer-grade, noisy data.

Core claim

By supervising contrastive sample selection with ECG and respiratory channels from ICU recordings, a PPG model can retain noisy segments during pretraining and thereby learn representations that generalize to noisy field PPG while requiring fewer subjects than existing foundation models.

What carries the argument

Multimodal physiological supervision that uses ECG and respiratory signals to select contrastive samples during pretraining on ICU PPG data.

If this is right

  • The model can be deployed on consumer wearables without access to large curated field PPG corpora for pretraining.
  • Robustness at inference improves because noisy PPG segments are retained rather than filtered during training.
  • Downstream performance gains appear on both clinical and daily-activity tasks that use real-world noisy signals.
  • Pretraining data requirements drop because three times fewer subjects suffice to reach or exceed prior results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same supervision strategy could be tested on other biosignals where auxiliary channels exist only during training but not at deployment.
  • If the auxiliary signals reduce selection bias, similar multimodal selection might improve single-channel foundation models in other sensing domains.
  • Lower subject counts could make large-scale pretraining feasible for groups with limited access to ICU datasets.

Load-bearing premise

Selecting contrastive samples with ECG and respiratory signals from ICU datasets will produce representations that generalize to noisy field-like PPG without selection bias or loss of information unique to the PPG channel.

What would settle it

Evaluation on a held-out set of noisy field PPG recordings in which the model shows no improvement over single-channel baselines or degrades on tasks that depend on subtle PPG features absent from the auxiliary signals.

Figures

Figures reproduced from arXiv: 2606.07365 by Daniel P. Darcy, Eloy Geenjaar, Gouthaman KV, Lie Lu, Scott Daly, Trisha Mittal, Vince Calhoun.

Figure 1
Figure 1. Figure 1: Multimodal contrastive supervision framework. (Left) The electrocardiogram (ECG) and respiratory (RESP) data co-recorded with PPG is segmented into 10s windows. Five metrics are extracted from the ECG and RESP segments that summarize those windows in a 5-dimensional vector. (Middle) The metrics are used to generate contrastive samples during pretraining. (Right) The unimodal PPG embeddings are evaluated us… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison with state-of-the-art. (Left) Classification results in terms of their macro F-1 score (larger area is better). (Right) Regression results in terms of their mean average error (smaller area is better). We evaluate across subject linear probing, and within subject linear probing [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: WildPPG heart rate estimation comparison across PPG sensor location (x-axis) and type (y-axis). aged 47-61 is worst and performance is lower for female subjects. These findings underscore existing challenges in equitable biosignal modeling and highlight areas for future bias mitigation. Heart rate estimation ablation [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: shows a visualization of the embedding space for a single subject from the PPG-DaLiA dataset, highlighting the difference between our method and our replication of PaPaGei-S (labeled as ’Unimodal’). In the figure, our model’s embedding space shows a clear gradient in terms of heart rate, whereas the other models do not. Moreover, since data availability for a new user may be sparse, in [PITH_FULL_IMAGE:fi… view at source ↗
Figure 6
Figure 6. Figure 6: Average performance across varying percentages of within-subject data. Shaded areas represent standard deviation across folds. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: ), middle-to-high (0.575 − 0.75) is defined to capture the bulk of the IMU energy distribution, and ultra-high IMU energy values (> 0.75) are ones that fall into the long tail of the distribution. These thresholds roughly correspond to distinct operating conditions ranging from relatively stable acquisition to severe motion artifacts. Heart rate regression results in each of these regimes are reported in … view at source ↗
read the original abstract

Photoplethysmography (PPG), a non-invasive measure of changes in blood volume, is widely used in both wearable devices and clinical settings. Recent PPG foundation models either use open-source ICU datasets with pretraining paradigms that require curated data and thus complicate generalization to field-like data, or use closed-source field-like PPG data. In contrast, we propose a PPG foundation model that does not require high-quality or field-like pretraining data, and instead leverages accompanying electrocardiogram and respiratory signals in ICU datasets to select contrastive samples during pretraining. Our approach allows the model to retain and learn from noisy PPG segments, improving robustness at inference. Our model, pretrained on 3x fewer subjects than existing state-of-the-art approaches, achieves performance improvements on 14 out of 15 diverse downstream tasks, including field-like daily activity and heart rate prediction. Our results demonstrate that multimodal supervision can integrate complementary physiological information to improve the robustness of PPG foundation models and enhance their generalization to consumer-grade data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a PPG foundation model pretrained on ICU datasets by using accompanying ECG and respiratory signals to select contrastive samples. This multimodal supervision is intended to allow the model to retain and learn from noisy PPG segments, yielding a representation robust to field-like artifacts. The model is pretrained on 3× fewer subjects than prior SOTA approaches and is reported to improve performance on 14 of 15 downstream tasks, including daily activity recognition and heart-rate estimation on consumer-grade data.

Significance. If the central empirical claim holds after addressing selection-bias concerns, the result would be significant: it would show that abundant, lower-quality ICU recordings can be leveraged for robust PPG pretraining without curated field data, lowering the barrier to foundation-model development for wearable PPG applications.

major comments (2)
  1. [§3] §3 (multimodal sample selection): the procedure that retains PPG segments whose ECG/resp agreement exceeds a threshold must be shown not to embed ICU-specific cross-channel correlations that are unavailable at inference on field PPG; without an explicit test (e.g., performance drop when the same selection rule is applied to ambulatory data or an ablation that removes ECG/resp guidance), the 14/15-task gains could be explained by retained ICU cues rather than improved robustness.
  2. [Results] Results section (downstream evaluation): the headline claim of improvement on 14/15 tasks is presented without reported statistical tests, confidence intervals, or per-task effect sizes; in addition, an ablation that trains the identical architecture with random (non-multimodal) contrastive sampling is required to isolate whether the reported gains are attributable to the proposed supervision rather than architecture or data volume.
minor comments (2)
  1. [Abstract] Abstract and §1: quantitative deltas, baseline names, and subject counts for the 3× reduction claim should be stated explicitly rather than left as qualitative assertions.
  2. [§3] Notation in §3: the precise definition of the contrastive loss and the ECG/resp agreement metric (e.g., cross-correlation threshold) should be given as an equation rather than prose description.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate revisions to the manuscript.

read point-by-point responses
  1. Referee: §3 (multimodal sample selection): the procedure that retains PPG segments whose ECG/resp agreement exceeds a threshold must be shown not to embed ICU-specific cross-channel correlations that are unavailable at inference on field PPG; without an explicit test (e.g., performance drop when the same selection rule is applied to ambulatory data or an ablation that removes ECG/resp guidance), the 14/15-task gains could be explained by retained ICU cues rather than improved robustness.

    Authors: We agree that it is important to confirm the gains arise from robustness rather than retained ICU correlations. The selection procedure is applied exclusively during pretraining; inference uses PPG alone. We will add an ablation that trains the identical architecture with random (non-multimodal) contrastive sampling on the same data to isolate the contribution of ECG/resp guidance. We note that directly applying the selection rule to ambulatory data is not possible, as those datasets lack the auxiliary ECG and respiratory channels required to compute agreement. revision: partial

  2. Referee: Results section (downstream evaluation): the headline claim of improvement on 14/15 tasks is presented without reported statistical tests, confidence intervals, or per-task effect sizes; in addition, an ablation that trains the identical architecture with random (non-multimodal) contrastive sampling is required to isolate whether the reported gains are attributable to the proposed supervision rather than architecture or data volume.

    Authors: We accept that statistical reporting and the requested ablation are needed for rigor. The revised manuscript will include per-task statistical tests, confidence intervals, and effect sizes for all 15 downstream tasks. We will also report the ablation using random contrastive sampling with the same architecture and data to attribute gains specifically to the multimodal supervision. revision: yes

standing simulated objections not resolved
  • Direct application of the multimodal selection rule to ambulatory data, as such datasets lack the ECG and respiratory signals required to compute the agreement threshold.

Circularity Check

0 steps flagged

No circularity; purely empirical pretraining with held-out evaluation

full rationale

The paper presents an empirical contrastive pretraining method that selects samples using accompanying ECG/respiratory channels from ICU data, then evaluates the resulting encoder on 15 downstream tasks (including field-like ones) using held-out data. No equations, derivations, uniqueness theorems, or fitted parameters are described that reduce a claimed prediction to the method's own inputs by construction. The central performance claims (14/15 improvements, 3x fewer subjects) are external benchmarks, not self-referential. This matches the default non-circular case for empirical ML work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The method appears to rest on standard contrastive learning assumptions common in self-supervised ML without additional ad-hoc constructs described.

pith-pipeline@v0.9.1-grok · 5715 in / 1125 out tokens · 21319 ms · 2026-06-27T22:19:03.650150+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2312.05409 , year=

    Abbaspourazad, S., Elachqar, O., Miller, A. C., Emrani, S., Nallasamy, U., and Shapiro, I. Large-scale training of foundation models for wearable biosignals.arXiv preprint arXiv:2312.05409,

  2. [2]

    C., and Shapiro, I

    Abbaspourazad, S., Mishra, A., Futoma, J., Miller, A. C., and Shapiro, I. Wearable accelerometer foundation mod- els for health via knowledge distillation.arXiv preprint arXiv:2412.11276,

  3. [3]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

  4. [4]

    ncbi.nlm.nih.gov/books/NBK482414/

    URL https://www. ncbi.nlm.nih.gov/books/NBK482414/. [Up- dated 2023 Jun 5]. Capulli, E., Druda, Y ., Palmese, F., Butt, A. H., Domenicali, M., Macchiarelli, A. G., Silvani, A., Bedogni, G., and Ingravallo, F. Ethical and legal implications of health monitoring wearable devices: A scoping review.Social Science & Medicine, pp. 117685,

  5. [5]

    S., Goh, H., Sandino, C

    Chien, H.-Y . S., Goh, H., Sandino, C. M., and Cheng, J. Y . Maeeg: Masked auto-encoder for eeg representation learning.arXiv preprint arXiv:2211.02625,

  6. [6]

    Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,

  7. [7]

    Pro- moting cross-modal representations to improve multi- modal foundation models for physiological signals.arXiv preprint arXiv:2410.16424,

    Fang, C., Sandino, C., Mahasseni, B., Minxha, J., Pouransari, H., Azemi, E., Moin, A., and Zippi, E. Pro- moting cross-modal representations to improve multi- modal foundation models for physiological signals.arXiv preprint arXiv:2410.16424,

  8. [8]

    L., Pouransari, H., Sandino, C., Nie, J., Goh, H., Azemi, E., and Moin, A

    Liu, R., Zippi, E. L., Pouransari, H., Sandino, C., Nie, J., Goh, H., Azemi, E., and Moin, A. Frequency- aware masked autoencoders for multimodal pretraining on biosignals.arXiv preprint arXiv:2309.05927,

  9. [9]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    doi: 10.3758/ s13428-020-01516-y. URL https://doi.org/10. 3758%2Fs13428-020-01516-y. McInnes, L., Healy, J., and Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426,

  10. [10]

    A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

    Nie, Y ., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730,

  11. [11]

    Pa- pagei: Open foundation models for optical physiological signals.arXiv preprint arXiv:2410.20542,

    Pillai, A., Spathis, D., Kawsar, F., and Malekzadeh, M. Pa- pagei: Open foundation models for optical physiological signals.arXiv preprint arXiv:2410.20542,

  12. [12]

    A., Mao, W., Neupane, S., Rehg, J

    Saha, M., Xu, M. A., Mao, W., Neupane, S., Rehg, J. M., and Kumar, S. Pulse-ppg: An open-source field-trained ppg foundation model for wearable applications across lab and field settings.arXiv preprint arXiv:2502.01108,

  13. [13]

    Thapa, R., He, B., Kjaer, M

    URL https://arxiv.org/ abs/2211.10831. Thapa, R., He, B., Kjaer, M. R., Moore IV , H., Ganjoo, G., Mignot, E., and Zou, J. Y . Sleepfm: Multi-modal representation learning for sleep across ecg, eeg and res- piratory signals. InAAAI 2024 Spring Symposium on Clinical Foundation Models,

  14. [14]

    Unsuper- vised representation learning for time series with temporal neighborhood coding.arXiv preprint arXiv:2106.00750,

    Tonekaboni, S., Eytan, D., and Goldenberg, A. Unsuper- vised representation learning for time series with temporal neighborhood coding.arXiv preprint arXiv:2106.00750,

  15. [15]

    Vital videos: A dataset of face videos with ppg and blood pressure ground truths.arXiv preprint arXiv:2306.11891,

    Toye, P.-J. Vital videos: A dataset of face videos with ppg and blood pressure ground truths.arXiv preprint arXiv:2306.11891,

  16. [16]

    Cardiorespiratory dynamic response to mental stress: A multivariate time-frequency analysis.Computational and mathematical methods in medicine, 2013(1):451857,

    Widjaja, D., Orini, M., Vlemincx, E., and Van Huffel, S. Cardiorespiratory dynamic response to mental stress: A multivariate time-frequency analysis.Computational and mathematical methods in medicine, 2013(1):451857,

  17. [17]

    A., Moreno, A., Wei, H., Marlin, B

    Xu, M. A., Moreno, A., Wei, H., Marlin, B. M., and Rehg, J. M. Rebar: Retrieval-based reconstruction for time-series contrastive learning.arXiv preprint arXiv:2311.00519,

  18. [18]

    A., Narain, J., Darnell, G., Hallgrimsson, H., Jeong, H., Forde, D., Fineman, R., Raghuram, K

    Xu, M. A., Narain, J., Darnell, G., Hallgrimsson, H., Jeong, H., Forde, D., Fineman, R., Raghuram, K. J., Rehg, J. M., and Ren, S. Relcon: Relative contrastive learning for a motion foundation model for wearable data.arXiv preprint arXiv:2411.18822,

  19. [19]

    A., Narayan- swamy, G., Xu, M

    Zhang, Y ., Ayush, K., Qiao, S., Heydari, A. A., Narayan- swamy, G., Xu, M. A., Metwally, A. A., Xu, S., Garrison, J., Xu, X., et al. Sensorlm: Learning the language of wear- able sensors.arXiv preprint arXiv:2506.09108,

  20. [20]

    ECG and RESP pre-processing We identify sessions containing more than one hour of continuous data across all three modalities: ECG, RESP, and PPG

    13 A Robust PPG Foundation Model using multimodal physiological supervision A. ECG and RESP pre-processing We identify sessions containing more than one hour of continuous data across all three modalities: ECG, RESP, and PPG. The ECG and RESP signals are filtered using NeuroKit (Makowski et al., 2021), and then used to detect peaks: R-peaks in ECG and res...

  21. [21]

    Although the dataset records data from a variety of physiological sensors, we only select the PPG data, which is recorded with a 64Hz sensor

    contains 15 subjects recorded in a lab setting. Although the dataset records data from a variety of physiological sensors, we only select the PPG data, which is recorded with a 64Hz sensor. In terms of PPG preprocessing we follow (Xu et al., 2023), whose preprocessing code is available on GitHub. We adapt the preprocessing code to obtain 10s non-overlappi...

  22. [22]

    The study consists of baseline dataset collection, a VR familiarity task, and then a set of VR stimuli with post-exposure questionnaires

    measures PPG data at 125Hz while 37 subjects are wearing a virtual reality (VR) headset. The study consists of baseline dataset collection, a VR familiarity task, and then a set of VR stimuli with post-exposure questionnaires. To evoke specific levels of arousal and valence, the authors use annotated 360 ◦ videos from a public database (Li et al., 2017), ...

  23. [23]

    There are three PPG recordings for each subject that last around 2 second each, and 219 subjects in total

    released the PPG blood pressure dataset ( PPG-BP), with PPG sampled at 1000Hz. There are three PPG recordings for each subject that last around 2 second each, and 219 subjects in total. We noticed some issues with resampling the data, so we decided to linearly interpolate the data instead. Using np.interp (Harris et al., 2020), we interpolate each segment...

  24. [24]

    The dataset records data from 16 subjects, and each PPG sensor records at 128 Hz

    released the WildPPG database. The dataset records data from 16 subjects, and each PPG sensor records at 128 Hz. The ground truth estimate of the heart rate is estimated with an ECG trace recorded from each subject’s sternum. The dataset contains data for three types of PPG sensors: green, red, and infrared (IR), and four types of locations: wrist, head, ...

  25. [25]

    Checkpoint selection.During pretraining we save checkpoints for the backbone every 5000 steps

    The output embedding for our model is thus a (2×batch size,512)tensor. Checkpoint selection.During pretraining we save checkpoints for the backbone every 5000 steps. To select the final checkpoint that we use for comparisons, we evaluate each checkpoint on every single downstream cross-subject task, except the WildPPG tasks. To ensure there is no data lea...