Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots
Pith reviewed 2026-06-27 09:41 UTC · model grok-4.3
The pith
A spatial-temporal transformer estimates heart rate from video at 0.79 bpm error under varying illumination by using 3D face alignment, augmentation, and hybrid waveform-spectral loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the described spatial-temporal transformer, after incorporating PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and hybrid supervision with β set to 5, produces heart-rate estimates whose mean absolute error reaches 0.79 bpm and whose correlation reaches 0.982 on the authors' static all-level mix protocol covering three illumination levels, corresponding to a 93.6 % reduction in error and an increase in correlation from 0.088 to 0.982 relative to the PhysFormer baseline evaluated on the same data.
What carries the argument
The end-to-end spatial-temporal transformer framework that applies PRNet-based 3D face alignment, clip-level illumination augmentation, Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision whose β weight balances waveform and spectral losses.
If this is right
- Heart-rate estimation from robot-mounted cameras becomes usable across three distinct illumination levels when the hybrid loss is weighted at β=5.
- The method reduces mean absolute error by 93.6 % and raises correlation from 0.088 to 0.982 relative to the PhysFormer baseline on the tested dataset.
- Performance is strongest among the β values examined when frequency-domain guidance receives five times the weight of the waveform loss.
- The combination of 3D face alignment and clip-level illumination augmentation supports the reported robustness on the static protocol.
Where Pith is reading between the lines
- If the static-protocol results hold for moving robots, the same pipeline could support continuous physiological awareness during human-robot interaction in homes or care settings.
- The hybrid loss structure might be adapted to estimate additional signals such as breathing rate by swapping the target frequency band.
- Deployment on robots would require additional checks for motion blur and subject movement not present in the static test protocol.
Load-bearing premise
The assumption that the listed components together will deliver the reported accuracy under real robot deployment conditions with moving subjects and naturally changing light rather than only on the authors' static all-level mix protocol.
What would settle it
Running the trained estimator on video recorded by a moving robot camera in everyday indoor lighting with non-static human subjects and measuring whether the mean absolute error stays below 2 bpm and the correlation stays above 0.9.
Figures
read the original abstract
Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote photoplethysmography (rPPG) enables non-contact heart-rate (HR) estimation from an RGB camera, making it a promising sensing modality for robot-mounted vision systems. However, illumination variation remains a major barrier to robust deployment. This paper presents an end-to-end spatial-temporal transformer framework for remote HR estimation on a new dataset with varied illumination. Our estimator integrates PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision. The training objective combines a Soft-Shifted Pearson waveform loss with a spectral Kullback-Leibler divergence loss, where a tuned weight ($\mathbf{\beta}$) controls the contribution of frequency-domain heart-rate guidance. Experiments on a static all-level mix protocol covering three illumination levels show that $\mathbf{\beta}=5$ provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982. Compared with the PhysFormer baseline evaluated on our dataset, our estimator reduces HR MAE by 93.6 %, while increasing HR correlation from 0.088 to 0.982, making it usable when illumination varies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to present an end-to-end spatial-temporal transformer framework for remote photoplethysmography (rPPG) heart-rate estimation that is robust to illumination variation for use in robot physiological sensing. The method integrates PRNet-based 3D face alignment, clip-level illumination augmentation, a Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision via a Soft-Shifted Pearson waveform loss combined with spectral Kullback-Leibler divergence, with a tuned weight β controlling the frequency term. On a new dataset under a static all-level mix protocol covering three illumination levels, β=5 yields a best-run MAE of 0.79 bpm and correlation of 0.982, reported as a 93.6% MAE reduction and correlation increase from 0.088 to 0.982 relative to the PhysFormer baseline evaluated on the same data, supporting the claim of usability when illumination varies.
Significance. If the reported gains are shown to hold under robot-mounted camera conditions, the work would provide a useful advance in illumination-robust non-contact HR sensing for service, social, and assistive robots by addressing a practical deployment barrier in dynamic environments.
major comments (2)
- [Abstract] Abstract: β is explicitly chosen as the value (β=5) that produces the strongest result among tested settings on the same static all-level mix evaluation protocol used to report the final MAE of 0.79 bpm and correlation of 0.982; this selection makes the contribution of the frequency-domain term data-dependent rather than independently derived.
- [Abstract] Abstract: The central claim is that the estimator is usable for robot physiological sensing under varying illumination, yet all quantitative results derive exclusively from a static all-level mix protocol; no results are provided under camera ego-motion, subject head translation/rotation, or changing subject-camera distance that would occur with a moving robot platform.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract and evaluation protocol. We address each major comment below and will incorporate revisions to improve clarity and precision.
read point-by-point responses
-
Referee: [Abstract] Abstract: β is explicitly chosen as the value (β=5) that produces the strongest result among tested settings on the same static all-level mix evaluation protocol used to report the final MAE of 0.79 bpm and correlation of 0.982; this selection makes the contribution of the frequency-domain term data-dependent rather than independently derived.
Authors: We acknowledge the validity of this observation. The value β=5 was selected because it produced the strongest result among the tested settings on the reported protocol. In the revised manuscript, we will expand the abstract to report performance for the full range of tested β values and explicitly note that β=5 corresponds to the best configuration observed on this dataset, thereby making the hyperparameter selection process transparent. revision: yes
-
Referee: [Abstract] Abstract: The central claim is that the estimator is usable for robot physiological sensing under varying illumination, yet all quantitative results derive exclusively from a static all-level mix protocol; no results are provided under camera ego-motion, subject head translation/rotation, or changing subject-camera distance that would occur with a moving robot platform.
Authors: We agree that the reported results are confined to a static all-level mix protocol and do not include camera ego-motion or dynamic subject-camera geometry. The present work isolates illumination variation as the primary variable. We will revise the abstract and add a limitations paragraph to state that the method demonstrates illumination robustness under static conditions and to identify evaluation under robot-mounted dynamic conditions as an important direction for future work. revision: yes
Circularity Check
β hyperparameter selected by performance on evaluation protocol
specific steps
-
fitted input called prediction
[Abstract]
"Experiments on a static all-level mix protocol covering three illumination levels show that β=5 provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982."
The weight β is chosen as the value among tested settings that yields the strongest result on the reported evaluation protocol; the quoted performance numbers are therefore obtained by selecting the hyperparameter that optimizes the reported metrics rather than being an independent outcome of the method.
full rationale
The paper reports performance numbers obtained after selecting β=5 as the value that yields the strongest result on the static all-level mix protocol used for evaluation. This constitutes a fitted_input_called_prediction pattern because the reported MAE and correlation are the outcome of choosing the hyperparameter that optimizes those exact metrics. No other circularity patterns (self-definitional, self-citation load-bearing, etc.) are present in the provided text; the method derivation itself does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- beta =
5
axioms (1)
- domain assumption PRNet-based 3D face alignment remains accurate under the three illumination levels used in the static all-level mix protocol
invented entities (1)
-
Residual Temporal Standardization Module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Survey on physio- logical computing in human-robot collaboration
Celal Savur and Ferat Sahin. Survey on physio- logical computing in human-robot collaboration. Machines, 11(5):536, 2023. doi: 10.3390/ma- chines11050536
work page doi:10.3390/ma- 2023
-
[2]
Remote plethysmographic imaging us- ing ambient light.Optics express, 16(26):21434– 21445, 2008
Wim Verkruysse, Lars O Svaasand, and J Stuart Nelson. Remote plethysmographic imaging us- ing ambient light.Optics express, 16(26):21434– 21445, 2008. doi: 10.1364/OE.16.021434
-
[3]
Non-contact video-based pulse rate measurement on a mobile service robot
Ronny Stricker, Steffen Mueller, and Horst- Michael Gross. Non-contact video-based pulse rate measurement on a mobile service robot. InProceedings of the 23rd IEEE International Symposium on Robot and Human Interactive Communication, pages 1056–1062, 2014. doi: 10.1109/ROMAN.2014.6926392
-
[4]
Zitong Yu, Xiaobai Li, Xuesong Niu, Jingang Shi, and Guoying Zhao. AutoHR: A strong end-to-end baseline for remote heart rate mea- surement with neural searching.IEEE Signal Processing Letters, 27:1245–1249, 2020. doi: 10.1109/LSP.2020.3007086
-
[5]
In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Zitong Yu, Yuming Shen, Jingang Shi, Heng- shuang Zhao, Philip HS Torr, and Guoying Zhao. PhysFormer: facial video-based physi- ological measurement with temporal difference 7 transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4186–4196, 2022. doi: 10.1109/CVPR52688.2022.00415
-
[6]
Jun Seong Lee, Gyutae Hwang, Moonwook Ryu, and Sang Jun Lee. LSTC-rPPG: Long short-term convolutional network for remote photoplethysmography. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi: 10.1109/CVPRW59228.2023.00640
-
[7]
Kang Cen, Chang-Hong Fu, and Hong Hong. Ro- bust and generalizable heart rate estimation via deep learning for remote photoplethysmography in complex scenarios.arXiv preprint, 2025. doi: 10.48550/arXiv.2507.07795
-
[8]
Wei Qian, Dan Guo, Jinxing Zhou, Bochao Zou, Zitong Yu, and Meng Wang. FreqPhys: Re- purposing implicit physiological frequency prior for robust remote photoplethysmography.arXiv preprint, 2026. doi: 10.48550/arXiv.2604.00534
-
[9]
Non -contact, automated cardiac pulse measurements using video imaging and blind source separation
Ming-Zher Poh, Daniel J. McDuff, and Ros- alind W. Picard. Non-contact, automated car- diac pulse measurements using video imaging and blind source separation.Optics Express, 18 (10):10762–10774, 2010. All Open Access, Gold Open Access; doi: 10.1364/OE.18.010762
-
[10]
Robust pulse rate from chrominance-based rPPG,
Gerard de Haan and Vincent Jeanne. Ro- bust pulse rate from chrominance-based rppg.IEEE Transactions on Biomedical Engineering, 60(10):2878–2886, 2013. doi: 10.1109/TBME.2013.2266196
-
[11]
Wenjin Wang, Albertus C. den Brinker, Sander Stuijk, and Gerard de Haan. Algorithmic prin- ciples of remote ppg.IEEE Transactions on Biomedical Engineering, 64(7):1479–1491, 2017. doi: 10.1109/TBME.2016.2609282
-
[12]
Non-contact heart rate mea- surement based on facial videos
Chien-Chih Wang. Non-contact heart rate mea- surement based on facial videos. Master’s thesis, National Cheng Kung University, No. 1, Dasyue Rd, East District, Tainan City, 701, 2020
2020
-
[13]
Ze Yang, Haofei Wang, and Feng Lu. As- sessment of deep learning-based heart rate esti- mation using remote photoplethysmography un- der different illuminations.IEEE Transactions on Human-Machine Systems, 52(6):1236–1246,
-
[14]
doi: 10.1109/THMS.2022.3207755
-
[15]
Joint 3d face reconstruction and dense alignment with position map regression network
Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. Joint 3d face reconstruction and dense alignment with position map regression network. InProceedings of the European confer- ence on computer vision (ECCV),pages534–551,
-
[16]
doi: 10.1007/978-3-030-01264-9_32
-
[17]
Comparative analysis of non- end-to-end and end-to-end deep learning models with 2d and 3d face alignment for remote heart rate estimation
Yu-Chiao Wang. Comparative analysis of non- end-to-end and end-to-end deep learning models with 2d and 3d face alignment for remote heart rate estimation. Master’s thesis, National Cheng Kung University, Tainan, Taiwan, 6 2025
2025
-
[18]
Kegang Wang, Jiankai Tang, Yantao Wei, Mingxuan Liu, Xin Liu, and Yuntao Wang. A plug-and-play temporal normalization module for robust remote photoplethysmography.arXiv preprint, 2024. doi: 10.48550/arXiv.2411.15283. 8
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.