Remote Heart Rate Measurement from Highly Compressed Facial Videos: an End-to-end Deep Learning Solution with Video Enhancement
Pith reviewed 2026-05-24 14:41 UTC · model grok-4.3
The pith
A two-stage neural network recovers heart rate signals from heavily compressed face videos by first restoring lost pulse information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a Spatio-Temporal Video Enhancement Network (STVEN) can recover rPPG information lost to video compression and, when jointly trained with an rPPGNet, enables accurate heart-rate extraction from highly compressed facial videos. The rPPGNet alone already gives robust measurements; adding the jointly trained STVEN further improves results especially under strong compression. The same pipeline also generalizes to entirely new datasets that contain only compressed videos.
What carries the argument
The Spatio-Temporal Video Enhancement Network (STVEN) that restores hidden rPPG signals before they reach the rPPG measurement network, trained end-to-end with it.
If this is right
- The rPPGNet component can be used by itself for robust measurement when enhancement is not needed.
- Joint training of the two stages produces the largest gains precisely on the most compressed inputs.
- Performance holds when the system is trained and tested on novel data that supplies only compressed videos.
- The approach therefore opens the door to real-world remote-healthcare uses where video is always compressed.
Where Pith is reading between the lines
- The same recovery idea could be tested on other video-based physiological signals such as respiration rate.
- In deployed systems the method might lower the bandwidth needed for remote monitoring without sacrificing accuracy.
- Live-stream tests with changing compression rates would reveal whether the enhancement step remains stable under variable network conditions.
Load-bearing premise
That the hidden rPPG information lost to compression can be recovered by the STVEN enhancement network when the two stages are jointly trained.
What would settle it
If the jointly trained system fails to beat a plain rPPGNet on a fresh collection of highly compressed videos that have no high-quality reference pairs, the recovery claim would be falsified.
Figures
read the original abstract
Remote photoplethysmography (rPPG), which aims at measuring heart activities without any contact, has great potential in many applications (e.g., remote healthcare). Existing rPPG approaches rely on analyzing very fine details of facial videos, which are prone to be affected by video compression. Here we propose a two-stage, end-to-end method using hidden rPPG information enhancement and attention networks, which is the first attempt to counter video compression loss and recover rPPG signals from highly compressed videos. The method includes two parts: 1) a Spatio-Temporal Video Enhancement Network (STVEN) for video enhancement, and 2) an rPPG network (rPPGNet) for rPPG signal recovery. The rPPGNet can work on its own for robust rPPG measurement, and the STVEN network can be added and jointly trained to further boost the performance especially on highly compressed videos. Comprehensive experiments are performed on two benchmark datasets to show that, 1) the proposed method not only achieves superior performance on compressed videos with high-quality videos pair, 2) it also generalizes well on novel data with only compressed videos available, which implies the promising potential for real world applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage end-to-end deep learning pipeline for remote photoplethysmography (rPPG) heart-rate measurement from highly compressed facial videos. It consists of a Spatio-Temporal Video Enhancement Network (STVEN) that is jointly trained with an rPPGNet to recover subtle color-change signals lost to compression; the rPPGNet can also be used standalone. The central claims are (1) superior performance when paired high-quality/compressed training data are available and (2) good generalization to novel compressed-only test videos on two benchmark datasets, with implications for real-world deployment.
Significance. If the recovery and generalization claims hold under realistic compression mismatch, the work would address a practical barrier in contactless vital-sign monitoring where video streams are routinely compressed. The joint-training architecture and the explicit separation of enhancement and measurement stages are technically interesting strengths; however, the manuscript provides no quantitative results, error bars, dataset statistics, or ablation studies in the supplied abstract, limiting immediate assessment of impact.
major comments (2)
- [Abstract] Abstract: the generalization claim ('generalizes well on novel data with only compressed videos available') is load-bearing for the real-world applicability statement, yet the text supplies no information on how the compression parameters (codec, bitrate, GOP structure) of the training pairs compare to those of the novel test videos. Without such detail or a controlled mismatch experiment, it is impossible to evaluate whether STVEN recovers genuine rPPG components or merely learns dataset-specific artifacts.
- [Abstract] The weakest assumption identified in the stress-test note is not addressed: joint training of STVEN + rPPGNet can recover information destroyed by compression only if the paired training distribution matches the test compressions. The manuscript does not report any cross-compression validation (e.g., training on H.264 and testing on VP9 or different bitrates), which directly undermines the transfer claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the generalization claims. We address the major comments point-by-point below and will revise the manuscript to improve clarity on experimental details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the generalization claim ('generalizes well on novel data with only compressed videos available') is load-bearing for the real-world applicability statement, yet the text supplies no information on how the compression parameters (codec, bitrate, GOP structure) of the training pairs compare to those of the novel test videos. Without such detail or a controlled mismatch experiment, it is impossible to evaluate whether STVEN recovers genuine rPPG components or merely learns dataset-specific artifacts.
Authors: We agree the abstract should specify the compression parameters. The full manuscript (Sections 3.3 and 4.1) details that all videos use the H.264 codec; training pairs are created by compressing original high-quality videos at bitrates of 200-800 kbps, while test videos use the same codec at held-out bitrates (e.g., 100-300 kbps) and different subjects to simulate novel compressed data. We will update the abstract with a concise statement of these settings and add a summary table of codec/bitrate configurations. revision: yes
-
Referee: [Abstract] The weakest assumption identified in the stress-test note is not addressed: joint training of STVEN + rPPGNet can recover information destroyed by compression only if the paired training distribution matches the test compressions. The manuscript does not report any cross-compression validation (e.g., training on H.264 and testing on VP9 or different bitrates), which directly undermines the transfer claim.
Authors: The reported experiments evaluate generalization across unseen bitrates and subjects within the H.264 codec, which matches common deployment scenarios where the codec remains fixed. No cross-codec tests (H.264 to VP9) or explicit GOP-structure mismatch experiments appear in the manuscript. We will add a limitations paragraph acknowledging this scope and noting it as valuable future work, while retaining the within-codec results as evidence of robustness to bitrate variation. revision: partial
Circularity Check
No circularity: performance claims rest on external benchmark training and evaluation, not self-referential definitions or fitted inputs.
full rationale
The paper describes a two-stage neural architecture (STVEN + rPPGNet) trained end-to-end on paired high-quality/compressed video data from external benchmarks. All reported performance numbers and generalization statements are empirical outcomes of that training process rather than quantities derived by algebraic reduction from the model's own parameters or prior self-citations. No equations, uniqueness theorems, or ansatzes are presented that would make any claimed result tautological with its inputs. The central claim therefore remains falsifiable against held-out data and does not reduce to a self-definition or fitted-input prediction.
Axiom & Free-Parameter Ledger
free parameters (1)
- STVEN and rPPGNet weights
axioms (1)
- domain assumption Deep neural networks can recover rPPG-relevant features from compressed video after spatio-temporal enhancement
Reference graph
Works this paper leans on
-
[1]
F. Bellard, M. Niedermayer, and et al. Ffmpeg. [online]. available: http://ffmpeg.org. 6
-
[2]
S. Chaichulee, M. Villarroel, J. Jorge, C. Arteta, G. Green, K. McCormick, A. Zisserman, and L. Tarassenko. Multi-task convolutional neural network for patient detection and skin segmentation in continuous non-contact vital sign monitor- ing. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on , pages 266–
work page 2017
-
[3]
W. Chen and D. McDuff. Deepphys: Video-based physiolog- ical measurement using convolutional attention networks. In ECCV , 2018. 2, 6, 8
work page 2018
-
[4]
G. de Haan and V . Jeanne. Robust pulse rate from chrominance-based rppg. IEEE Trans. Biomed. Eng. , 60(10):2878–2886, 2013. 1, 2, 4, 6, 7, 8
work page 2013
-
[5]
C. Dong, Y . Deng, C. Change Loy, and X. Tang. Compres- sion artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Com- puter Vision, pages 576–584, 2015. 2, 8
work page 2015
-
[6]
L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo. Deep generative adversarial compression artifact removal. In ICCV , 2017. 3
work page 2017
-
[7]
S. Hanfland and M. Paul. Video format dependency of ppgi signals. In Proceedings of the International Conference on Electrical Engineering, 2016. 1, 2
work page 2016
-
[8]
ITU-T. Rec. h.262 - information technology - generic coding of moving pictures and associated audio information: Video. International Telecommunication Union Telecommunication Standardization Sector (ITU-T), Tech. Rep., 1995. 2
work page 1995
-
[9]
J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision , pages 694–711. Springer,
- [10]
-
[11]
X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-V oltti, M. Tulppo, and G. Zhao. The obf database: A large face video database for remote physio- logical signal measurement and atrial fibrillation detection. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) , pages 242–249. IEEE, 2018. 5, 6, 7
work page 2018
-
[12]
X. Li, J. Chen, G. Zhao, and M. Pietik ¨ainen. Remote heart rate measurement from face videos under realistic situations. in CVPR, 2014. 1, 2, 8
work page 2014
-
[13]
D. Liu, B. Wen, X. Liu, Z. Wang, and T. S. Huang. When im- age denoising meets high-level vision tasks: A deep learning approach. In IJCAI, 2018. 5
work page 2018
-
[14]
D. McDuff. Deep super resolution for recovering physi- ological information from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition Workshops, pages 1367–1374, 2018. 3
work page 2018
-
[15]
D. J. McDuff, E. B. Blackford, and J. R. Estepp. The impact of video compression on remote cardiac pulse measurement using imaging photoplethysmography. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE Interna- tional Conference on, pages 63–70. IEEE, 2017. 1, 2, 7
work page 2017
-
[16]
X. Niu, H. Han, S. Shan, and X. Chen. Synrhythm: Learning a deep heart rate estimator from general to specific. InICPR,
-
[17]
M.-Z. Poh, D. J. McDuff, and R. W. Picard. Non-contact, automated cardiac pulse measurements using video imaging and blind source separation. Opt. Express, 18(10):10762– 10774, 2010. 1, 2
work page 2010
-
[18]
M.-Z. Poh, D. J. McDuff, and R. W. Picard. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Trans. Biomed. Eng. , 58(1):7–11,
-
[19]
N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. As- tola, and V . Lukin. On between-coefficient contrast mask- ing of dct basis functions. In Proceedings of the third inter- national workshop on video processing and quality metrics , volume 4, 2007. 9
work page 2007
-
[20]
A. Puri and A. Eleftheriadis. Mpeg-4: An object-based mul- timedia coding standard supporting mobile applications.Mo- bile Networks and Applications, 3(1):5–32, 1998. 2
work page 1998
-
[21]
J. Shi, I. Alikhani, X. Li, Z. Yu, T. Sepp ¨anen, and G. Zhao. Atrial fibrillation detection from face videos by fusing sub- tle variations. IEEE Transactions on Circuits and Systems for Video Technology, DOI 10.1109/TCSVT.2019.2926632,
-
[22]
M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic. A multimodal database for affect recognition and implicit tag- ging. IEEE Transactions on Affective Computing , 3(1):42– 55, 2012. 5, 6
work page 2012
-
[23]
R. Spetl ´ık, J. Cech, and J. Matas. Non-contact reflectance photoplethysmography: Progress, limitations, and myths. In Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on , pages 702–709. IEEE, 2018. 2, 7 9
work page 2018
-
[24]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014. 5
work page 1929
-
[25]
G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overview of the high efficiency video coding (hevc) stan- dard. IEEE Transactions on circuits and systems for video technology, 22(12):1649–1668, 2012. 2
work page 2012
-
[26]
C. Tang, J. Lu, and J. Liu. Non-contact heart rate monitor- ing by combining convolutional neural network skin detec- tion and remote photoplethysmography via a low-cost cam- era. In Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition Workshops, pages 1309–1315,
-
[27]
M. J. Taylor and T. Morris. Adaptive skin segmentation via feature-based face detection. In Real-Time Image and Video Processing 2014, volume 9139, page 91390P. International Society for Optics and Photonics, 2014. 5
work page 2014
-
[28]
D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 6450– 6459, 2018. 4
work page 2018
-
[29]
S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn, and N. Sebe. Self-adaptive matrix completion for heart rate estimation from face videos under realistic con- ditions. in CVPR, 2016. 1, 2, 8
work page 2016
-
[30]
W. Verkruysse, L. O. Svaasand, and J. S. Nelson. Remote plethysmographic imaging using ambient light. Opt. Ex- press, 16(26):21434–21445, Dec 2008. 1, 8
work page 2008
-
[31]
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In null, page 511. IEEE, 2001. 6
work page 2001
-
[32]
J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y . Gong. Locality-constrained linear coding for image classification. In 2010 IEEE computer society conference on computer vi- sion and pattern recognition , pages 3360–3367. Citeseer,
work page 2010
-
[33]
W. Wang, A. C. den Brinker, S. Stuijk, and G. de Haan. Al- gorithmic principles of remote ppg. IEEE Transactions on Biomedical Engineering, 64(7):1479–1491, 2017. 2, 4, 6, 7
work page 2017
-
[34]
W. Wang, S. Stuijk, and G. de Haan. A novel algorithm for remote photoplethysmography: Spatial subspace rota- tion. IEEE Trans. Biomed. Eng. , 63(9):1974–1984, 2016. 2
work page 1974
-
[35]
T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology , 13(7):560–576, 2003. 2
work page 2003
-
[36]
R. Yang, M. Xu, Z. Wang, and T. Li. Multi-frame quality enhancement for compressed video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 6664–6673, 2018. 3
work page 2018
- [37]
-
[38]
C. Zhao, C.-L. Lin, W. Chen, and Z. Li. A novel framework for remote photoplethysmography pulse extraction on com- pressed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1299–1308, 2018. 3
work page 2018
-
[39]
J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image- to-image translation using cycle-consistent adversarial net- workss. In Computer Vision (ICCV), 2017 IEEE Interna- tional Conference on, 2017. 4
work page 2017
- [40]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.