pith. sign in

arxiv: 2603.03331 · v2 · submitted 2026-02-10 · 💻 cs.CL · cs.AI

PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning

Pith reviewed 2026-05-16 02:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords photoplethysmographyPPGquestion answeringmultimodal learningphysiological monitoringdatasetlanguage modelsbiosignals
0
0 comments X

The pith

PulseLM reformats over a million PPG segments from sixteen sources into nearly 2.5 million natural-language question-answer pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PulseLM as a large dataset that connects raw photoplethysmography waveforms directly to text by turning existing numerical annotations into question-answer pairs. It pools recordings from sixteen public sources, standardizes them into more than one million 10-second segments, and produces almost 2.5 million QA pairs spread across twelve tasks. This QA structure lets multimodal large language models perform language-based inference on physiological signals instead of working only with numbers. The authors supply the data, pipelines, training recipes, and evaluation protocols so that different teams can run comparable experiments. A reader would care because the format makes it feasible to build intuitive text interfaces for continuous health monitoring that today depend on separate numerical models.

Core claim

PulseLM aggregates PPG recordings from sixteen publicly available sources and harmonizes heterogeneous annotations into 12 downstream tasks. The resulting dataset contains over 1 million standardized 10-second PPG segments paired with nearly 2.5 million question-answer pairs. The authors define reproducible data pipelines, training procedures, and evaluation protocols, then establish baseline benchmarks with multimodal PPG-aware large language models. This supplies a standardized foundation for language-grounded physiological inference, cross-dataset generalization, and scalable benchmarking of PPG-based multimodal models.

What carries the argument

The unified question-answering formulation that converts heterogeneous PPG numerical labels and measurements into natural-language question-answer pairs across twelve tasks.

If this is right

  • Multimodal models can be trained end-to-end to answer natural-language questions about PPG waveforms.
  • The single dataset format supports direct measurement of how well models generalize across different PPG collection devices and settings.
  • Reproducible training and evaluation protocols allow consistent comparison of new PPG-text methods against the provided baselines.
  • The 2.5 million QA pairs supply sufficient scale for fine-tuning or instruction-tuning large language models on physiological signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Voice assistants or chat interfaces could eventually query wearable devices about real-time cardiovascular state using the same QA format.
  • The harmonized data may surface signal features that remain stable across clinical, lab, and consumer-grade PPG sensors.
  • Extending the QA pairs to include forward-looking questions could support predictive tasks such as estimating future blood-pressure trends from current waveforms.

Load-bearing premise

Reformatting numerical PPG labels from many different sources into a single question-answer format preserves enough clinical meaning for language models to perform accurate physiological inference.

What would settle it

If a multimodal model trained on PulseLM achieves no higher accuracy on the original numerical tasks than task-specific models when both are tested on held-out segments from the source datasets, the QA conversion would have lost critical information.

Figures

Figures reproduced from arXiv: 2603.03331 by Aaqib Saeed, Bin Zhu, Dong Ma, Hung Manh Pham, Jinyang Wu, Xiao Ma, Yiming Zhang, Yixin Xu, Zhou Pan.

Figure 1
Figure 1. Figure 1: Overview of our dataset study. intervals, variable pulse amplitudes, and disrupted waveform mor￾phology captured by PPG. These tasks require models to capture fine-grained morphological and rhythm structure and longer-range temporal dependencies within the signal. More recently, PPG has been explored in a range of non-traditional and higher-level infer￾ence domains. Studies have demonstrated its utility fo… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of benchmarking PPG language modeling. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Demonstration of label distributions in PulseLM dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Photoplethysmography (PPG) is a widely used non-invasive sensing modality for continuous cardiovascular and physiological monitoring across clinical, laboratory, and wearable settings. While existing PPG datasets support a broad range of downstream tasks, they typically provide supervision in the form of numerical measurements or task-specific labels, limiting their compatibility with language-based interfaces and multimodal foundation models. In this work, we introduce PulseLM, a large-scale PPG-text question-answering dataset that bridges raw PPG waveforms and natural language through a unified question-answering (QA) formulation. PulseLM aggregates PPG recordings from sixteen publicly available sources and harmonizes heterogeneous annotations into 12 downstream tasks. The dataset comprises over 1 million standardized 10-second PPG segments, associated with nearly 2.5 million question-answer pairs. We further define reproducible data pipeline, training, and evaluation protocols and establish baseline benchmarks using multimodal PPG-aware large language models. PulseLM provides a standardized foundation for studying language-grounded physiological inference, cross-dataset generalization, and scalable benchmarking of PPG-based multimodal models. We publicly release the dataset and code at https://huggingface.co/datasets/Manhph2211/PulseLM and https://github.com/manhph2211/PULSE-LM, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PulseLM, a large-scale PPG-text QA dataset that aggregates recordings from sixteen public sources into over 1 million standardized 10-second segments paired with nearly 2.5 million question-answer pairs spanning 12 downstream tasks. It defines reproducible data pipelines, training protocols, and evaluation benchmarks using multimodal PPG-aware LLMs, with public release of the dataset and code.

Significance. If the harmonization of heterogeneous annotations into QA format is shown to preserve clinical fidelity, PulseLM would provide a valuable standardized foundation for language-grounded physiological inference, cross-dataset generalization, and benchmarking of multimodal models. The public release of data and code plus the emphasis on reproducible pipelines are concrete strengths that would facilitate community adoption.

major comments (2)
  1. [Methods (harmonization pipeline)] The harmonization process that converts numerical labels from heterogeneous sources (varying devices, sampling rates, and cohorts) into unified natural-language QA pairs lacks any reported quantitative checks on label fidelity, inter-source consistency, or expert validation of the generated pairs; this is load-bearing for the claim that the 2.5M pairs support reliable language-grounded inference.
  2. [Experiments and baselines] Baseline results for the 12 tasks are presented without ablation studies isolating the effect of harmonization choices or metrics quantifying noise introduced by label conversion; without these, it is unclear whether downstream performance reflects true physiological signal or source-specific artifacts.
minor comments (2)
  1. [Abstract] The abstract states that the dataset 'bridges raw PPG waveforms and natural language' but the precise mapping from 10-second segments to QA pairs should be illustrated with concrete examples in the main text.
  2. [Dataset description] Table or figure captions describing the 12 tasks should explicitly list the original source labels that were mapped to each task to improve traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper accordingly to strengthen the validation of the harmonization pipeline and the experimental analyses.

read point-by-point responses
  1. Referee: The harmonization process that converts numerical labels from heterogeneous sources (varying devices, sampling rates, and cohorts) into unified natural-language QA pairs lacks any reported quantitative checks on label fidelity, inter-source consistency, or expert validation of the generated pairs; this is load-bearing for the claim that the 2.5M pairs support reliable language-grounded inference.

    Authors: We acknowledge that the current manuscript does not report quantitative validation of the harmonization process. Section 3 details the deterministic rule-based mappings from source annotations to QA pairs, but we agree these lack explicit fidelity checks. In the revision, we will add: (i) inter-source consistency metrics computed on overlapping cohorts (e.g., agreement rates between original labels and QA-derived values), (ii) fidelity scores comparing numerical ground truth to QA interpretations on a 10k-segment held-out set, and (iii) results from expert clinician review of a 500-pair random sample assessing clinical accuracy and natural language quality. These additions will directly support the reliability of the 2.5M pairs. revision: yes

  2. Referee: Baseline results for the 12 tasks are presented without ablation studies isolating the effect of harmonization choices or metrics quantifying noise introduced by label conversion; without these, it is unclear whether downstream performance reflects true physiological signal or source-specific artifacts.

    Authors: We agree that the absence of targeted ablations limits interpretability of the baseline results. In the revised version, we will incorporate: (i) ablation experiments comparing multimodal LLM performance on harmonized QA pairs versus direct numerical supervision (where source labels permit), and (ii) noise quantification metrics including cross-source performance variance and label-perturbation sensitivity analysis. These will isolate the impact of harmonization choices and demonstrate that reported performance primarily reflects physiological signal rather than conversion artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in dataset aggregation and release

full rationale

The paper's central contribution is the construction and public release of PulseLM, formed by aggregating 16 existing public PPG sources and converting their heterogeneous numerical annotations into a unified QA format across 12 tasks. No derivations, equations, fitted parameters, or model predictions are present that could reduce to inputs by construction. The work contains no self-citation chains, uniqueness theorems, or ansatzes that bear load on the claims; the harmonization process is described as a reproducible pipeline without invoking prior author results as external justification. This is a standard data-release paper whose validity rests on the transparency of the aggregation steps rather than any self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or fitted parameters; the work rests on the domain assumption that public PPG sources can be harmonized into QA without loss of utility.

axioms (1)
  • domain assumption Heterogeneous PPG annotations from 16 sources can be reliably mapped to 12 unified downstream tasks via QA formulation.
    Invoked in the abstract when describing harmonization of annotations.

pith-pipeline@v0.9.0 · 6677 in / 1139 out tokens · 161126 ms · 2026-05-16T02:13:53.627973+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    PulseLM aggregates PPG recordings from fifteen publicly available sources and harmonizes heterogeneous annotations into twelve common physiologically QA tasks... all PPG recordings are standardized through a unified preprocessing pipeline comprising four stages: Resampling... Filtering... Segmentation... Normalization.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

  1. [1]

    Apple Watch

    Online. Apple Watch. https://www.apple.com/sg/watch/

  2. [2]

    EmbracePlus | The world’s most advanced smartwatch for continuous health monitoring

    Online. EmbracePlus | The world’s most advanced smartwatch for continuous health monitoring. https://www.empatica.com/en-int/embraceplus/

  3. [3]

    polarvantagev3

    Online. polarvantagev3. https://www.polar.com/sg-en/vantage/v3

  4. [4]

    Sennheiser Momentum Sport

    Online. Sennheiser Momentum Sport. https://newsroom.sennheiser.com/the- thrill-of-performance-mltzvt

  5. [5]

    Salar Abbaspourazad, Oussama Elachqar, Andrew Miller, Saba Emrani, Udhyaku- mar Nallasamy, and Ian Shapiro. 2024. Large-scale Training of Foundation Models for Wearable Biosignals. InThe Twelfth International Conference on Learning Rep- resentations

  6. [6]

    Nicolas Aguirre, Edith Grall-Maës, Leandro J Cymberknop, and Ricardo L Armen- tano. 2021. Blood pressure morphology assessment from photoplethysmogram and demographic information using deep learning with attention mechanism. Sensors21, 6 (2021), 2167

  7. [7]

    J Bacevičius, Z Abramikas, I Badaras, M Butkuvien˙e, S Daukantas, E Dvinelis, M Gudauskas, E Jukna, M Kiseli¯ute, R Kundelis, et al. 2024. Long-term electrocar- diogram and wrist-based photoplethysmogram recordings with annotated atrial fibrillation episodes.Dataset on Zenodo(2024)

  8. [8]

    Peter H Charlton, Kevin Kotzen, Elisa Mejía-Mejía, Philip J Aston, Karthik Bu- didha, Jonathan Mant, Callum Pettit, Joachim A Behar, and Panicos A Kyriacou

  9. [9]

    Detecting beats in the photoplethysmogram: benchmarking open-source algorithms.Physiological Measurement43, 8 (2022), 085007

  10. [10]

    S. K. Deric Tang, Y. Y. S. Goh, M. L. D. Wong, and Y. L. E. Lew. 2016. PPG signal reconstruction using a combination of discrete wavelet transform and empirical mode decomposition. IEEE, 1–4

  11. [11]

    Ainara Garde, Parastoo Dehkordi, Walter Karlen, David Wensley, J Mark Anser- mino, and Guy A Dumont. 2014. Development of a screening tool for sleep disordered breathing in children using the phone Oximeter™.PloS one9, 11 (2014), e112959

  12. [12]

    Sergio González, Wan-Ting Hsieh, and Trista Pei-Chun Chen. 2023. A bench- mark for machine-learning based non-invasive blood pressure estimation using photoplethysmogram.Scientific Data10, 1 (2023), 149

  13. [13]

    Matthew Yiwen Ho, Hung Manh Pham, Aaqib Saeed, and Dong Ma. 2025. WF- PPG: A wrist-finger dual-channel dataset for studying the impact of contact pressure on PPG morphology.Scientific Data12, 1 (2025), 200

  14. [14]

    Changshuo Hu, Hung Manh Pham, and Dong Ma. 2025. Morphology-Aware HRV Estimation from Wrist PPG in Sedentary Scenarios. InCompanion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 745–750

  15. [15]

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=nZeVKeeFYf9

  16. [16]

    Mohamad Kachuee, Mohammad Kiani, Hoda Mohammadzade, and Mahdi Sha- bany. 2015. Cuff-Less Blood Pressure Estimation. UCI Machine Learning Reposi- tory. doi:10.24432/C5B602

  17. [17]

    Mohamad Kachuee, Mohammad Mahdi Kiani, Hoda Mohammadzade, and Mahdi Shabany. 2015. Cuff-less high-accuracy calibration-free blood pressure estimation using pulse transit time. In2015 IEEE international symposium on circuits and systems (ISCAS). IEEE, 1006–1009

  18. [18]

    Kianoosh Kazemi, Iman Azimi, Pasi Liljeberg, and Amir M Rahmani. 2025. Respi- ration Rate Estimation via Smartwatch-based Photoplethysmography and Ac- celerometer Data: A Transfer Learning Approach.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9, 1 (2025), 1–24

  19. [19]

    Hyung-Chul Lee, Yoonsang Park, Soo Bin Yoon, Seong Mi Yang, Dongnyeok Park, and Chul-Woo Jung. 2022. VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients.Scientific Data9, 1 (2022), 279

  20. [20]

    Yong-Xian Li, Jiong-Ling Huang, Xin-Yu Yao, Si-Qi Mu, Shou-Xin Zong, and Yan-Fei Shen. 2024. A ballistocardiogram dataset with reference sensor signals in long-term natural sleep environments.Scientific Data11, 1 (2024), 1091

  21. [21]

    Yongbo Liang, Zhencheng Chen, Guiyong Liu, and Mohamed Elgendi. 2018. A new, short-recorded photoplethysmogram dataset for blood pressure monitoring in China.Scientific data5, 1 (2018), 1–7

  22. [22]

    David Liu, Matthias Görges, and Simon A Jenkins. 2012. University of Queensland vital signs dataset: Development of an accessible repository of anesthesia patient monitoring data for research.Anesthesia & Analgesia114, 3 (2012), 584–589

  23. [23]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruc- tion tuning. 34892–34916 pages

  24. [24]

    Zengding Liu, Bin Zhou, Zhiming Jiang, Xi Chen, Ye Li, Min Tang, and Fen Miao

  25. [25]

    Multiclass Arrhythmia Detection and Classification From Photoplethys- mography Signals Using a Deep Convolutional Neural Network.Journal of the American Heart Association11, 7 (2022), e023555

  26. [26]

    Lau, Jan C

    Dominique Makowski, Tam Pham, Zen J. Lau, Jan C. Brammer, François Lespinasse, Hung Pham, Christopher Schölzel, and S. H. Annabel Chen. 2021. NeuroKit2: A Python toolbox for neurophysiological signal processing.Behavior Research Methods53, 4 (feb 2021), 1689–1696. doi:10.3758/s13428-020-01516-y

  27. [27]

    Manuel Meier, Berken Utku Demirel, and Christian Holz. 2024. WildPPG: A Real-World PPG Dataset of Long Continuous Recordings.Advances in Neural Information Processing Systems37 (2024), 2246–2266

  28. [28]

    Alessandro Montanari, Andrea Ferlini, Ananta Narayanan Balaji, Cecilia Mascolo, and Fahim Kawsar. 2023. Earset: A multi-modal dataset for studying the impact of head and facial movements on in-ear ppg signals.Scientific data10, 1 (2023), 850

  29. [29]

    Jungwoo Oh, Gyubok Lee, Seongsu Bae, Joon-myoung Kwon, and Edward Choi

  30. [30]

    Ecg-qa: A comprehensive question answering dataset combined with electrocardiogram.Advances in Neural Information Processing Systems36 (2023), 66277–66288

  31. [31]

    Jiating Pan, Lishi Liang, Yongbo Liang, Qunfeng Tang, Zhencheng Chen, and Jianming Zhu. 2024. Robust modelling of arterial blood pressure reconstruction from photoplethysmography.Scientific Reports14, 1 (2024), 30333

  32. [32]

    Fulai Peng, Zhengbo Zhang, Xiaoming Gou, Hongyun Liu, and Weidong Wang

  33. [33]

    BioMedical Engineering Online13, 1 (April 2014)

    Motion artifact removal from photoplethysmographic signals by combining temporally constrained independent component analysis and adaptive filter. BioMedical Engineering Online13, 1 (April 2014). doi:10.1186/1475-925x-13-50

  34. [34]

    Hung Manh Pham, Matthew Yiwen Ho, Yiming Zhang, Dimitris Spathis, Aaqib Saeed, and Dong Ma. 2025. Reliable wrist PPG monitoring by nitigating poor skin sensor contact.Scientific Reports(2025)

  35. [35]

    Hung Manh Pham, Jialu Tang, Aaqib Saeed, and Dong Ma. 2025. Q-HEART: ECG Question Answering via Knowledge-Informed Multimodal LLMs. InPro- ceedings of the European Conference on Artificial Intelligence (ECAI) (Fron- tiers in Artificial Intelligence and Applications, Vol. 413). IOS Press, 4545–4552. doi:10.3233/FAIA251356

  36. [36]

    Arvind Pillai, Dimitris Spathis, Fahim Kawsar, and Mohammad Malekzadeh

  37. [37]

    In The Thirteenth International Conference on Learning Representations, ICLR 2025

    PaPaGei: Open Foundation Models for Optical Physiological Signals. In The Thirteenth International Conference on Learning Representations, ICLR 2025. Singapore. [https://arxiv.org/abs/2410.20542](https://arxiv.org/abs/2410.20542) Accepted. arXiv preprint arXiv:2410.20542

  38. [38]

    Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki Van Stein, and Thomas Bäck. 2025. Multi-step reasoning with large language models, a survey. Comput. Surveys58, 6 (2025), 1–35

  39. [39]

    Attila Reiss, Ina Indlekofer, and Philip Schmidt. 2019. PPG-DaLiA. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C53890

  40. [40]

    Attila Reiss, Ina Indlekofer, Philip Schmidt, and Kristof Van Laerhoven. 2019. Deep PPG: Large-scale heart rate estimation with convolutional neural networks. Sensors19, 14 (2019), 3079

  41. [41]

    Xiang Yue Ruoqi Liu, Yuelin Bai and Ping Zhang. 2024. Teach Multimodal LLMs to Comprehend Electrocardiographic Images.arXiv preprint arXiv:2410.19008 (2024)

  42. [42]

    Xu, Wanting Mao, Sameer Neupane, James M

    Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg, and Santosh Kumar. 2025. Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications across Lab and Field Settings.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.9, 3, Article 126 (Sept. 2025), 35 pages. doi:10.1145/3749494

  43. [43]

    Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and Kristof Van Laerhoven. 2018. Introducing wesad, a multimodal dataset for wearable stress and affect detection. InProceedings of the 20th ACM international conference on multimodal interaction. 400–408

  44. [44]

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. 2025. MedGemma Technical Report.arXiv preprint arXiv:2507.05201(2025)

  45. [45]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388 Conference acronym ’XX, XX, XXXX

  46. [46]

    Min Wang, Zhe Li, Qirui Zhang, and Guoxing Wang. 2019. Removal of Motion Artifacts in Photoplethysmograph Sensors during Intensive Exercise for Accurate Heart Rate Calculation Based on Frequency Estimation and Notch Filtering. Sensors19, 15 (July 2019), 3312. doi:10.3390/s19153312

  47. [47]

    Jingye Xu, Yuntong Zhang, Wei Wang, Mimi Xie, and Dakai Zhu. 2025. A Compre- hensive PPG-based Dataset for HR/HRV Studies.arXiv preprint arXiv:2505.18165 (2025)

  48. [48]

    Amir Hosein Afandizadeh Zargari, Seyed Amir Hossein Aqajari, Hadi Khodaban- deh, Amir Rahmani, and Fadi Kurdahi. 2023. An Accurate Non-accelerometer- based PPG Motion Artifact Removal Technique using CycleGAN.ACM Transac- tions on Computing for Healthcare4, 1 (Jan. 2023), 1–14. doi:10.1145/3563949

  49. [49]

    Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A Ali Heydari, Girish Narayanswamy, Maxwell A Xu, Ahmed A Metwally, Shawn Xu, Jake Garrison, Xuhai Xu, et al

  50. [50]

    SensorLM: Learning the Language of Wearable Sensors.arXiv preprint arXiv:2506.09108(2025). PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning Conference acronym ’XX, XX, XXXX A Appendix A.1 Source Dataset Details In our study, we utilize various of public PPG datasets as the sources to construct the QA dataset. In this section, we will intr...