PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning
Pith reviewed 2026-05-16 02:13 UTC · model grok-4.3
The pith
PulseLM reformats over a million PPG segments from sixteen sources into nearly 2.5 million natural-language question-answer pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PulseLM aggregates PPG recordings from sixteen publicly available sources and harmonizes heterogeneous annotations into 12 downstream tasks. The resulting dataset contains over 1 million standardized 10-second PPG segments paired with nearly 2.5 million question-answer pairs. The authors define reproducible data pipelines, training procedures, and evaluation protocols, then establish baseline benchmarks with multimodal PPG-aware large language models. This supplies a standardized foundation for language-grounded physiological inference, cross-dataset generalization, and scalable benchmarking of PPG-based multimodal models.
What carries the argument
The unified question-answering formulation that converts heterogeneous PPG numerical labels and measurements into natural-language question-answer pairs across twelve tasks.
If this is right
- Multimodal models can be trained end-to-end to answer natural-language questions about PPG waveforms.
- The single dataset format supports direct measurement of how well models generalize across different PPG collection devices and settings.
- Reproducible training and evaluation protocols allow consistent comparison of new PPG-text methods against the provided baselines.
- The 2.5 million QA pairs supply sufficient scale for fine-tuning or instruction-tuning large language models on physiological signals.
Where Pith is reading between the lines
- Voice assistants or chat interfaces could eventually query wearable devices about real-time cardiovascular state using the same QA format.
- The harmonized data may surface signal features that remain stable across clinical, lab, and consumer-grade PPG sensors.
- Extending the QA pairs to include forward-looking questions could support predictive tasks such as estimating future blood-pressure trends from current waveforms.
Load-bearing premise
Reformatting numerical PPG labels from many different sources into a single question-answer format preserves enough clinical meaning for language models to perform accurate physiological inference.
What would settle it
If a multimodal model trained on PulseLM achieves no higher accuracy on the original numerical tasks than task-specific models when both are tested on held-out segments from the source datasets, the QA conversion would have lost critical information.
Figures
read the original abstract
Photoplethysmography (PPG) is a widely used non-invasive sensing modality for continuous cardiovascular and physiological monitoring across clinical, laboratory, and wearable settings. While existing PPG datasets support a broad range of downstream tasks, they typically provide supervision in the form of numerical measurements or task-specific labels, limiting their compatibility with language-based interfaces and multimodal foundation models. In this work, we introduce PulseLM, a large-scale PPG-text question-answering dataset that bridges raw PPG waveforms and natural language through a unified question-answering (QA) formulation. PulseLM aggregates PPG recordings from sixteen publicly available sources and harmonizes heterogeneous annotations into 12 downstream tasks. The dataset comprises over 1 million standardized 10-second PPG segments, associated with nearly 2.5 million question-answer pairs. We further define reproducible data pipeline, training, and evaluation protocols and establish baseline benchmarks using multimodal PPG-aware large language models. PulseLM provides a standardized foundation for studying language-grounded physiological inference, cross-dataset generalization, and scalable benchmarking of PPG-based multimodal models. We publicly release the dataset and code at https://huggingface.co/datasets/Manhph2211/PulseLM and https://github.com/manhph2211/PULSE-LM, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PulseLM, a large-scale PPG-text QA dataset that aggregates recordings from sixteen public sources into over 1 million standardized 10-second segments paired with nearly 2.5 million question-answer pairs spanning 12 downstream tasks. It defines reproducible data pipelines, training protocols, and evaluation benchmarks using multimodal PPG-aware LLMs, with public release of the dataset and code.
Significance. If the harmonization of heterogeneous annotations into QA format is shown to preserve clinical fidelity, PulseLM would provide a valuable standardized foundation for language-grounded physiological inference, cross-dataset generalization, and benchmarking of multimodal models. The public release of data and code plus the emphasis on reproducible pipelines are concrete strengths that would facilitate community adoption.
major comments (2)
- [Methods (harmonization pipeline)] The harmonization process that converts numerical labels from heterogeneous sources (varying devices, sampling rates, and cohorts) into unified natural-language QA pairs lacks any reported quantitative checks on label fidelity, inter-source consistency, or expert validation of the generated pairs; this is load-bearing for the claim that the 2.5M pairs support reliable language-grounded inference.
- [Experiments and baselines] Baseline results for the 12 tasks are presented without ablation studies isolating the effect of harmonization choices or metrics quantifying noise introduced by label conversion; without these, it is unclear whether downstream performance reflects true physiological signal or source-specific artifacts.
minor comments (2)
- [Abstract] The abstract states that the dataset 'bridges raw PPG waveforms and natural language' but the precise mapping from 10-second segments to QA pairs should be illustrated with concrete examples in the main text.
- [Dataset description] Table or figure captions describing the 12 tasks should explicitly list the original source labels that were mapped to each task to improve traceability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper accordingly to strengthen the validation of the harmonization pipeline and the experimental analyses.
read point-by-point responses
-
Referee: The harmonization process that converts numerical labels from heterogeneous sources (varying devices, sampling rates, and cohorts) into unified natural-language QA pairs lacks any reported quantitative checks on label fidelity, inter-source consistency, or expert validation of the generated pairs; this is load-bearing for the claim that the 2.5M pairs support reliable language-grounded inference.
Authors: We acknowledge that the current manuscript does not report quantitative validation of the harmonization process. Section 3 details the deterministic rule-based mappings from source annotations to QA pairs, but we agree these lack explicit fidelity checks. In the revision, we will add: (i) inter-source consistency metrics computed on overlapping cohorts (e.g., agreement rates between original labels and QA-derived values), (ii) fidelity scores comparing numerical ground truth to QA interpretations on a 10k-segment held-out set, and (iii) results from expert clinician review of a 500-pair random sample assessing clinical accuracy and natural language quality. These additions will directly support the reliability of the 2.5M pairs. revision: yes
-
Referee: Baseline results for the 12 tasks are presented without ablation studies isolating the effect of harmonization choices or metrics quantifying noise introduced by label conversion; without these, it is unclear whether downstream performance reflects true physiological signal or source-specific artifacts.
Authors: We agree that the absence of targeted ablations limits interpretability of the baseline results. In the revised version, we will incorporate: (i) ablation experiments comparing multimodal LLM performance on harmonized QA pairs versus direct numerical supervision (where source labels permit), and (ii) noise quantification metrics including cross-source performance variance and label-perturbation sensitivity analysis. These will isolate the impact of harmonization choices and demonstrate that reported performance primarily reflects physiological signal rather than conversion artifacts. revision: yes
Circularity Check
No significant circularity in dataset aggregation and release
full rationale
The paper's central contribution is the construction and public release of PulseLM, formed by aggregating 16 existing public PPG sources and converting their heterogeneous numerical annotations into a unified QA format across 12 tasks. No derivations, equations, fitted parameters, or model predictions are present that could reduce to inputs by construction. The work contains no self-citation chains, uniqueness theorems, or ansatzes that bear load on the claims; the harmonization process is described as a reproducible pipeline without invoking prior author results as external justification. This is a standard data-release paper whose validity rests on the transparency of the aggregation steps rather than any self-referential logic.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Heterogeneous PPG annotations from 16 sources can be reliably mapped to 12 unified downstream tasks via QA formulation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PulseLM aggregates PPG recordings from fifteen publicly available sources and harmonizes heterogeneous annotations into twelve common physiologically QA tasks... all PPG recordings are standardized through a unified preprocessing pipeline comprising four stages: Resampling... Filtering... Segmentation... Normalization.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
EmbracePlus | The world’s most advanced smartwatch for continuous health monitoring
Online. EmbracePlus | The world’s most advanced smartwatch for continuous health monitoring. https://www.empatica.com/en-int/embraceplus/
- [3]
-
[4]
Online. Sennheiser Momentum Sport. https://newsroom.sennheiser.com/the- thrill-of-performance-mltzvt
-
[5]
Salar Abbaspourazad, Oussama Elachqar, Andrew Miller, Saba Emrani, Udhyaku- mar Nallasamy, and Ian Shapiro. 2024. Large-scale Training of Foundation Models for Wearable Biosignals. InThe Twelfth International Conference on Learning Rep- resentations
work page 2024
-
[6]
Nicolas Aguirre, Edith Grall-Maës, Leandro J Cymberknop, and Ricardo L Armen- tano. 2021. Blood pressure morphology assessment from photoplethysmogram and demographic information using deep learning with attention mechanism. Sensors21, 6 (2021), 2167
work page 2021
-
[7]
J Bacevičius, Z Abramikas, I Badaras, M Butkuvien˙e, S Daukantas, E Dvinelis, M Gudauskas, E Jukna, M Kiseli¯ute, R Kundelis, et al. 2024. Long-term electrocar- diogram and wrist-based photoplethysmogram recordings with annotated atrial fibrillation episodes.Dataset on Zenodo(2024)
work page 2024
-
[8]
Peter H Charlton, Kevin Kotzen, Elisa Mejía-Mejía, Philip J Aston, Karthik Bu- didha, Jonathan Mant, Callum Pettit, Joachim A Behar, and Panicos A Kyriacou
-
[9]
Detecting beats in the photoplethysmogram: benchmarking open-source algorithms.Physiological Measurement43, 8 (2022), 085007
work page 2022
-
[10]
S. K. Deric Tang, Y. Y. S. Goh, M. L. D. Wong, and Y. L. E. Lew. 2016. PPG signal reconstruction using a combination of discrete wavelet transform and empirical mode decomposition. IEEE, 1–4
work page 2016
-
[11]
Ainara Garde, Parastoo Dehkordi, Walter Karlen, David Wensley, J Mark Anser- mino, and Guy A Dumont. 2014. Development of a screening tool for sleep disordered breathing in children using the phone Oximeter™.PloS one9, 11 (2014), e112959
work page 2014
-
[12]
Sergio González, Wan-Ting Hsieh, and Trista Pei-Chun Chen. 2023. A bench- mark for machine-learning based non-invasive blood pressure estimation using photoplethysmogram.Scientific Data10, 1 (2023), 149
work page 2023
-
[13]
Matthew Yiwen Ho, Hung Manh Pham, Aaqib Saeed, and Dong Ma. 2025. WF- PPG: A wrist-finger dual-channel dataset for studying the impact of contact pressure on PPG morphology.Scientific Data12, 1 (2025), 200
work page 2025
-
[14]
Changshuo Hu, Hung Manh Pham, and Dong Ma. 2025. Morphology-Aware HRV Estimation from Wrist PPG in Sedentary Scenarios. InCompanion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 745–750
work page 2025
-
[15]
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[16]
Mohamad Kachuee, Mohammad Kiani, Hoda Mohammadzade, and Mahdi Sha- bany. 2015. Cuff-Less Blood Pressure Estimation. UCI Machine Learning Reposi- tory. doi:10.24432/C5B602
-
[17]
Mohamad Kachuee, Mohammad Mahdi Kiani, Hoda Mohammadzade, and Mahdi Shabany. 2015. Cuff-less high-accuracy calibration-free blood pressure estimation using pulse transit time. In2015 IEEE international symposium on circuits and systems (ISCAS). IEEE, 1006–1009
work page 2015
-
[18]
Kianoosh Kazemi, Iman Azimi, Pasi Liljeberg, and Amir M Rahmani. 2025. Respi- ration Rate Estimation via Smartwatch-based Photoplethysmography and Ac- celerometer Data: A Transfer Learning Approach.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9, 1 (2025), 1–24
work page 2025
-
[19]
Hyung-Chul Lee, Yoonsang Park, Soo Bin Yoon, Seong Mi Yang, Dongnyeok Park, and Chul-Woo Jung. 2022. VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients.Scientific Data9, 1 (2022), 279
work page 2022
-
[20]
Yong-Xian Li, Jiong-Ling Huang, Xin-Yu Yao, Si-Qi Mu, Shou-Xin Zong, and Yan-Fei Shen. 2024. A ballistocardiogram dataset with reference sensor signals in long-term natural sleep environments.Scientific Data11, 1 (2024), 1091
work page 2024
-
[21]
Yongbo Liang, Zhencheng Chen, Guiyong Liu, and Mohamed Elgendi. 2018. A new, short-recorded photoplethysmogram dataset for blood pressure monitoring in China.Scientific data5, 1 (2018), 1–7
work page 2018
-
[22]
David Liu, Matthias Görges, and Simon A Jenkins. 2012. University of Queensland vital signs dataset: Development of an accessible repository of anesthesia patient monitoring data for research.Anesthesia & Analgesia114, 3 (2012), 584–589
work page 2012
-
[23]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruc- tion tuning. 34892–34916 pages
work page 2023
-
[24]
Zengding Liu, Bin Zhou, Zhiming Jiang, Xi Chen, Ye Li, Min Tang, and Fen Miao
-
[25]
Multiclass Arrhythmia Detection and Classification From Photoplethys- mography Signals Using a Deep Convolutional Neural Network.Journal of the American Heart Association11, 7 (2022), e023555
work page 2022
-
[26]
Dominique Makowski, Tam Pham, Zen J. Lau, Jan C. Brammer, François Lespinasse, Hung Pham, Christopher Schölzel, and S. H. Annabel Chen. 2021. NeuroKit2: A Python toolbox for neurophysiological signal processing.Behavior Research Methods53, 4 (feb 2021), 1689–1696. doi:10.3758/s13428-020-01516-y
-
[27]
Manuel Meier, Berken Utku Demirel, and Christian Holz. 2024. WildPPG: A Real-World PPG Dataset of Long Continuous Recordings.Advances in Neural Information Processing Systems37 (2024), 2246–2266
work page 2024
-
[28]
Alessandro Montanari, Andrea Ferlini, Ananta Narayanan Balaji, Cecilia Mascolo, and Fahim Kawsar. 2023. Earset: A multi-modal dataset for studying the impact of head and facial movements on in-ear ppg signals.Scientific data10, 1 (2023), 850
work page 2023
-
[29]
Jungwoo Oh, Gyubok Lee, Seongsu Bae, Joon-myoung Kwon, and Edward Choi
-
[30]
Ecg-qa: A comprehensive question answering dataset combined with electrocardiogram.Advances in Neural Information Processing Systems36 (2023), 66277–66288
work page 2023
-
[31]
Jiating Pan, Lishi Liang, Yongbo Liang, Qunfeng Tang, Zhencheng Chen, and Jianming Zhu. 2024. Robust modelling of arterial blood pressure reconstruction from photoplethysmography.Scientific Reports14, 1 (2024), 30333
work page 2024
-
[32]
Fulai Peng, Zhengbo Zhang, Xiaoming Gou, Hongyun Liu, and Weidong Wang
-
[33]
BioMedical Engineering Online13, 1 (April 2014)
Motion artifact removal from photoplethysmographic signals by combining temporally constrained independent component analysis and adaptive filter. BioMedical Engineering Online13, 1 (April 2014). doi:10.1186/1475-925x-13-50
-
[34]
Hung Manh Pham, Matthew Yiwen Ho, Yiming Zhang, Dimitris Spathis, Aaqib Saeed, and Dong Ma. 2025. Reliable wrist PPG monitoring by nitigating poor skin sensor contact.Scientific Reports(2025)
work page 2025
-
[35]
Hung Manh Pham, Jialu Tang, Aaqib Saeed, and Dong Ma. 2025. Q-HEART: ECG Question Answering via Knowledge-Informed Multimodal LLMs. InPro- ceedings of the European Conference on Artificial Intelligence (ECAI) (Fron- tiers in Artificial Intelligence and Applications, Vol. 413). IOS Press, 4545–4552. doi:10.3233/FAIA251356
-
[36]
Arvind Pillai, Dimitris Spathis, Fahim Kawsar, and Mohammad Malekzadeh
-
[37]
In The Thirteenth International Conference on Learning Representations, ICLR 2025
PaPaGei: Open Foundation Models for Optical Physiological Signals. In The Thirteenth International Conference on Learning Representations, ICLR 2025. Singapore. [https://arxiv.org/abs/2410.20542](https://arxiv.org/abs/2410.20542) Accepted. arXiv preprint arXiv:2410.20542
-
[38]
Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki Van Stein, and Thomas Bäck. 2025. Multi-step reasoning with large language models, a survey. Comput. Surveys58, 6 (2025), 1–35
work page 2025
-
[39]
Attila Reiss, Ina Indlekofer, and Philip Schmidt. 2019. PPG-DaLiA. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C53890
-
[40]
Attila Reiss, Ina Indlekofer, Philip Schmidt, and Kristof Van Laerhoven. 2019. Deep PPG: Large-scale heart rate estimation with convolutional neural networks. Sensors19, 14 (2019), 3079
work page 2019
- [41]
-
[42]
Xu, Wanting Mao, Sameer Neupane, James M
Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg, and Santosh Kumar. 2025. Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications across Lab and Field Settings.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.9, 3, Article 126 (Sept. 2025), 35 pages. doi:10.1145/3749494
-
[43]
Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and Kristof Van Laerhoven. 2018. Introducing wesad, a multimodal dataset for wearable stress and affect detection. InProceedings of the 20th ACM international conference on multimodal interaction. 400–408
work page 2018
-
[44]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. 2025. MedGemma Technical Report.arXiv preprint arXiv:2507.05201(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388 Conference acronym ’XX, XX, XXXX
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Min Wang, Zhe Li, Qirui Zhang, and Guoxing Wang. 2019. Removal of Motion Artifacts in Photoplethysmograph Sensors during Intensive Exercise for Accurate Heart Rate Calculation Based on Frequency Estimation and Notch Filtering. Sensors19, 15 (July 2019), 3312. doi:10.3390/s19153312
- [47]
-
[48]
Amir Hosein Afandizadeh Zargari, Seyed Amir Hossein Aqajari, Hadi Khodaban- deh, Amir Rahmani, and Fadi Kurdahi. 2023. An Accurate Non-accelerometer- based PPG Motion Artifact Removal Technique using CycleGAN.ACM Transac- tions on Computing for Healthcare4, 1 (Jan. 2023), 1–14. doi:10.1145/3563949
-
[49]
Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A Ali Heydari, Girish Narayanswamy, Maxwell A Xu, Ahmed A Metwally, Shawn Xu, Jake Garrison, Xuhai Xu, et al
-
[50]
SensorLM: Learning the Language of Wearable Sensors.arXiv preprint arXiv:2506.09108(2025). PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning Conference acronym ’XX, XX, XXXX A Appendix A.1 Source Dataset Details In our study, we utilize various of public PPG datasets as the sources to construct the QA dataset. In this section, we will intr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.