pith. sign in

arxiv: 2606.12378 · v1 · pith:JU7BK5C7new · submitted 2026-06-10 · 💻 cs.CV · cs.AI

Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

Pith reviewed 2026-06-27 09:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords remote photoplethysmographyheart rate estimationillumination robustnesstransformerrPPGphysiological sensingrobot vision
0
0 comments X

The pith

A spatial-temporal transformer estimates heart rate from video at 0.79 bpm error under varying illumination by using 3D face alignment, augmentation, and hybrid waveform-spectral loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an end-to-end framework for remote photoplethysmography that maintains accuracy for heart-rate estimation when lighting changes, a key requirement for cameras mounted on service or assistive robots. It integrates PRNet-based 3D face alignment to handle pose, clip-level illumination augmentation during training, a Residual Temporal Standardization Module, and a hybrid loss that balances a Soft-Shifted Pearson waveform term against a spectral Kullback-Leibler term weighted by a tunable β. On a dataset covering three illumination levels under a static all-level mix protocol, β=5 yields the lowest mean absolute error of 0.79 bpm and a correlation of 0.982. This performance is reported to exceed the PhysFormer baseline by a large margin on the same data.

Core claim

The paper claims that the described spatial-temporal transformer, after incorporating PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and hybrid supervision with β set to 5, produces heart-rate estimates whose mean absolute error reaches 0.79 bpm and whose correlation reaches 0.982 on the authors' static all-level mix protocol covering three illumination levels, corresponding to a 93.6 % reduction in error and an increase in correlation from 0.088 to 0.982 relative to the PhysFormer baseline evaluated on the same data.

What carries the argument

The end-to-end spatial-temporal transformer framework that applies PRNet-based 3D face alignment, clip-level illumination augmentation, Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision whose β weight balances waveform and spectral losses.

If this is right

  • Heart-rate estimation from robot-mounted cameras becomes usable across three distinct illumination levels when the hybrid loss is weighted at β=5.
  • The method reduces mean absolute error by 93.6 % and raises correlation from 0.088 to 0.982 relative to the PhysFormer baseline on the tested dataset.
  • Performance is strongest among the β values examined when frequency-domain guidance receives five times the weight of the waveform loss.
  • The combination of 3D face alignment and clip-level illumination augmentation supports the reported robustness on the static protocol.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the static-protocol results hold for moving robots, the same pipeline could support continuous physiological awareness during human-robot interaction in homes or care settings.
  • The hybrid loss structure might be adapted to estimate additional signals such as breathing rate by swapping the target frequency band.
  • Deployment on robots would require additional checks for motion blur and subject movement not present in the static test protocol.

Load-bearing premise

The assumption that the listed components together will deliver the reported accuracy under real robot deployment conditions with moving subjects and naturally changing light rather than only on the authors' static all-level mix protocol.

What would settle it

Running the trained estimator on video recorded by a moving robot camera in everyday indoor lighting with non-static human subjects and measuring whether the mean absolute error stays below 2 bpm and the correlation stays above 0.9.

Figures

Figures reproduced from arXiv: 2606.12378 by Torbj\"orn E. M. Nordling, Zhi Wei Xu.

Figure 1
Figure 1. Figure 1: Experimental setup for the non-contact physiological signal measurement protocol. Partici￾pants sit on the bike facing the camera mounted above the TV under controlled illumination. Reproduced from Wang (2020) [12]. 3.2 Preprocessing Raw videos are first processed by a PRNet-based 3D face alignment pipeline. PRNet predicts dense facial geometry and provides semantically aligned facial re￾gions through UV p… view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of our estimator. The input is a PRNet-preprocessed facial video clip. RTSM is the key module in this work; it is in￾serted after the convolutional stem and before tube￾token embedding to reduce brightness-induced tem￾poral feature-statistic shifts. The PhysFormer-style temporal-difference Transformer block is repeated N times; in our implementation, N = 12 and the blocks are grouped i… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the Residual Temporal Standardization Module. (a) It operates on the stem feature tensor X with shape [B, 96, 500, 16, 16]. (b) One temporal feature sequence x = X[b, c, :, h, w] is selected for visualization. (c) The temporal mean µT and standard deviation σT are computed over T = 500 frames for each channel and spatial loca￾tion rather than globally over all channels or spatial positions… view at source ↗
Figure 4
Figure 4. Figure 4: Learned RTSM residual coefficient α. 5 Discussion The static all-level mix results show that illumination robustness depends on both data-side and objective￾side design. PRNet preprocessing provides stable aligned facial inputs, which helps reduce spatial in￾consistency before learning. Clip-level illumination augmentation expands the apparent brightness and contrast distribution while preserving temporal … view at source ↗
read the original abstract

Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote photoplethysmography (rPPG) enables non-contact heart-rate (HR) estimation from an RGB camera, making it a promising sensing modality for robot-mounted vision systems. However, illumination variation remains a major barrier to robust deployment. This paper presents an end-to-end spatial-temporal transformer framework for remote HR estimation on a new dataset with varied illumination. Our estimator integrates PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision. The training objective combines a Soft-Shifted Pearson waveform loss with a spectral Kullback-Leibler divergence loss, where a tuned weight ($\mathbf{\beta}$) controls the contribution of frequency-domain heart-rate guidance. Experiments on a static all-level mix protocol covering three illumination levels show that $\mathbf{\beta}=5$ provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982. Compared with the PhysFormer baseline evaluated on our dataset, our estimator reduces HR MAE by 93.6 %, while increasing HR correlation from 0.088 to 0.982, making it usable when illumination varies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims to present an end-to-end spatial-temporal transformer framework for remote photoplethysmography (rPPG) heart-rate estimation that is robust to illumination variation for use in robot physiological sensing. The method integrates PRNet-based 3D face alignment, clip-level illumination augmentation, a Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision via a Soft-Shifted Pearson waveform loss combined with spectral Kullback-Leibler divergence, with a tuned weight β controlling the frequency term. On a new dataset under a static all-level mix protocol covering three illumination levels, β=5 yields a best-run MAE of 0.79 bpm and correlation of 0.982, reported as a 93.6% MAE reduction and correlation increase from 0.088 to 0.982 relative to the PhysFormer baseline evaluated on the same data, supporting the claim of usability when illumination varies.

Significance. If the reported gains are shown to hold under robot-mounted camera conditions, the work would provide a useful advance in illumination-robust non-contact HR sensing for service, social, and assistive robots by addressing a practical deployment barrier in dynamic environments.

major comments (2)
  1. [Abstract] Abstract: β is explicitly chosen as the value (β=5) that produces the strongest result among tested settings on the same static all-level mix evaluation protocol used to report the final MAE of 0.79 bpm and correlation of 0.982; this selection makes the contribution of the frequency-domain term data-dependent rather than independently derived.
  2. [Abstract] Abstract: The central claim is that the estimator is usable for robot physiological sensing under varying illumination, yet all quantitative results derive exclusively from a static all-level mix protocol; no results are provided under camera ego-motion, subject head translation/rotation, or changing subject-camera distance that would occur with a moving robot platform.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and evaluation protocol. We address each major comment below and will incorporate revisions to improve clarity and precision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: β is explicitly chosen as the value (β=5) that produces the strongest result among tested settings on the same static all-level mix evaluation protocol used to report the final MAE of 0.79 bpm and correlation of 0.982; this selection makes the contribution of the frequency-domain term data-dependent rather than independently derived.

    Authors: We acknowledge the validity of this observation. The value β=5 was selected because it produced the strongest result among the tested settings on the reported protocol. In the revised manuscript, we will expand the abstract to report performance for the full range of tested β values and explicitly note that β=5 corresponds to the best configuration observed on this dataset, thereby making the hyperparameter selection process transparent. revision: yes

  2. Referee: [Abstract] Abstract: The central claim is that the estimator is usable for robot physiological sensing under varying illumination, yet all quantitative results derive exclusively from a static all-level mix protocol; no results are provided under camera ego-motion, subject head translation/rotation, or changing subject-camera distance that would occur with a moving robot platform.

    Authors: We agree that the reported results are confined to a static all-level mix protocol and do not include camera ego-motion or dynamic subject-camera geometry. The present work isolates illumination variation as the primary variable. We will revise the abstract and add a limitations paragraph to state that the method demonstrates illumination robustness under static conditions and to identify evaluation under robot-mounted dynamic conditions as an important direction for future work. revision: yes

Circularity Check

1 steps flagged

β hyperparameter selected by performance on evaluation protocol

specific steps
  1. fitted input called prediction [Abstract]
    "Experiments on a static all-level mix protocol covering three illumination levels show that β=5 provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982."

    The weight β is chosen as the value among tested settings that yields the strongest result on the reported evaluation protocol; the quoted performance numbers are therefore obtained by selecting the hyperparameter that optimizes the reported metrics rather than being an independent outcome of the method.

full rationale

The paper reports performance numbers obtained after selecting β=5 as the value that yields the strongest result on the static all-level mix protocol used for evaluation. This constitutes a fitted_input_called_prediction pattern because the reported MAE and correlation are the outcome of choosing the hyperparameter that optimizes those exact metrics. No other circularity patterns (self-definitional, self-citation load-bearing, etc.) are present in the provided text; the method derivation itself does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the listed preprocessing and loss components plus the tuned β on the authors' dataset; no independent evidence is supplied for generalization beyond that dataset.

free parameters (1)
  • beta = 5
    Tuned scalar that balances the Soft-Shifted Pearson waveform loss against the spectral Kullback-Leibler divergence loss; selected as the value giving the strongest result among tested settings.
axioms (1)
  • domain assumption PRNet-based 3D face alignment remains accurate under the three illumination levels used in the static all-level mix protocol
    The framework description states that PRNet is integrated for face alignment as the first processing step.
invented entities (1)
  • Residual Temporal Standardization Module no independent evidence
    purpose: Standardizes temporal features to improve robustness to illumination variation
    Introduced as a core component of the proposed estimator; no external validation of its necessity is provided.

pith-pipeline@v0.9.1-grok · 5778 in / 1653 out tokens · 24754 ms · 2026-06-27T09:41:22.498408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 14 canonical work pages

  1. [1]

    Survey on physio- logical computing in human-robot collaboration

    Celal Savur and Ferat Sahin. Survey on physio- logical computing in human-robot collaboration. Machines, 11(5):536, 2023. doi: 10.3390/ma- chines11050536

  2. [2]

    Remote plethysmographic imaging us- ing ambient light.Optics express, 16(26):21434– 21445, 2008

    Wim Verkruysse, Lars O Svaasand, and J Stuart Nelson. Remote plethysmographic imaging us- ing ambient light.Optics express, 16(26):21434– 21445, 2008. doi: 10.1364/OE.16.021434

  3. [3]

    Non-contact video-based pulse rate measurement on a mobile service robot

    Ronny Stricker, Steffen Mueller, and Horst- Michael Gross. Non-contact video-based pulse rate measurement on a mobile service robot. InProceedings of the 23rd IEEE International Symposium on Robot and Human Interactive Communication, pages 1056–1062, 2014. doi: 10.1109/ROMAN.2014.6926392

  4. [4]

    AutoHR: A strong end-to-end baseline for remote heart rate mea- surement with neural searching.IEEE Signal Processing Letters, 27:1245–1249, 2020

    Zitong Yu, Xiaobai Li, Xuesong Niu, Jingang Shi, and Guoying Zhao. AutoHR: A strong end-to-end baseline for remote heart rate mea- surement with neural searching.IEEE Signal Processing Letters, 27:1245–1249, 2020. doi: 10.1109/LSP.2020.3007086

  5. [5]

    A ConvNet for the 2020s

    Zitong Yu, Yuming Shen, Jingang Shi, Heng- shuang Zhao, Philip HS Torr, and Guoying Zhao. PhysFormer: facial video-based physi- ological measurement with temporal difference 7 transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4186–4196, 2022. doi: 10.1109/CVPR52688.2022.00415

  6. [6]

    In: IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2023 - Workshops, Vancouver, BC, Canada, June 17-24, 2023

    Jun Seong Lee, Gyutae Hwang, Moonwook Ryu, and Sang Jun Lee. LSTC-rPPG: Long short-term convolutional network for remote photoplethysmography. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi: 10.1109/CVPRW59228.2023.00640

  7. [7]

    Ro- bust and generalizable heart rate estimation via deep learning for remote photoplethysmography in complex scenarios.arXiv preprint, 2025

    Kang Cen, Chang-Hong Fu, and Hong Hong. Ro- bust and generalizable heart rate estimation via deep learning for remote photoplethysmography in complex scenarios.arXiv preprint, 2025. doi: 10.48550/arXiv.2507.07795

  8. [8]

    FreqPhys: Re- purposing implicit physiological frequency prior for robust remote photoplethysmography.arXiv preprint, 2026

    Wei Qian, Dan Guo, Jinxing Zhou, Bochao Zou, Zitong Yu, and Meng Wang. FreqPhys: Re- purposing implicit physiological frequency prior for robust remote photoplethysmography.arXiv preprint, 2026. doi: 10.48550/arXiv.2604.00534

  9. [9]

    Non -contact, automated cardiac pulse measurements using video imaging and blind source separation

    Ming-Zher Poh, Daniel J. McDuff, and Ros- alind W. Picard. Non-contact, automated car- diac pulse measurements using video imaging and blind source separation.Optics Express, 18 (10):10762–10774, 2010. All Open Access, Gold Open Access; doi: 10.1364/OE.18.010762

  10. [10]

    Robust pulse rate from chrominance-based rPPG,

    Gerard de Haan and Vincent Jeanne. Ro- bust pulse rate from chrominance-based rppg.IEEE Transactions on Biomedical Engineering, 60(10):2878–2886, 2013. doi: 10.1109/TBME.2013.2266196

  11. [11]

    C., Stuijk, S., & De Haan, G

    Wenjin Wang, Albertus C. den Brinker, Sander Stuijk, and Gerard de Haan. Algorithmic prin- ciples of remote ppg.IEEE Transactions on Biomedical Engineering, 64(7):1479–1491, 2017. doi: 10.1109/TBME.2016.2609282

  12. [12]

    Non-contact heart rate mea- surement based on facial videos

    Chien-Chih Wang. Non-contact heart rate mea- surement based on facial videos. Master’s thesis, National Cheng Kung University, No. 1, Dasyue Rd, East District, Tainan City, 701, 2020

  13. [13]

    Ze Yang, Haofei Wang, and Feng Lu. As- sessment of deep learning-based heart rate esti- mation using remote photoplethysmography un- der different illuminations.IEEE Transactions on Human-Machine Systems, 52(6):1236–1246,

  14. [14]

    doi: 10.1109/THMS.2022.3207755

  15. [15]

    Joint 3d face reconstruction and dense alignment with position map regression network

    Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. Joint 3d face reconstruction and dense alignment with position map regression network. InProceedings of the European confer- ence on computer vision (ECCV),pages534–551,

  16. [16]

    doi: 10.1007/978-3-030-01264-9_32

  17. [17]

    Comparative analysis of non- end-to-end and end-to-end deep learning models with 2d and 3d face alignment for remote heart rate estimation

    Yu-Chiao Wang. Comparative analysis of non- end-to-end and end-to-end deep learning models with 2d and 3d face alignment for remote heart rate estimation. Master’s thesis, National Cheng Kung University, Tainan, Taiwan, 6 2025

  18. [18]

    A plug-and-play temporal normalization module for robust remote photoplethysmography.arXiv preprint, 2024

    Kegang Wang, Jiankai Tang, Yantao Wei, Mingxuan Liu, Xin Liu, and Yuntao Wang. A plug-and-play temporal normalization module for robust remote photoplethysmography.arXiv preprint, 2024. doi: 10.48550/arXiv.2411.15283. 8