pith. sign in

arxiv: 1907.11921 · v1 · pith:43KETZDInew · submitted 2019-07-27 · 📡 eess.IV · cs.CV

Remote Heart Rate Measurement from Highly Compressed Facial Videos: an End-to-end Deep Learning Solution with Video Enhancement

Pith reviewed 2026-05-24 14:41 UTC · model grok-4.3

classification 📡 eess.IV cs.CV
keywords remote photoplethysmographyrPPGvideo compressionheart rate measurementdeep learningvideo enhancementfacial video analysisend-to-end network
0
0 comments X

The pith

A two-stage neural network recovers heart rate signals from heavily compressed face videos by first restoring lost pulse information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a two-stage end-to-end deep learning method to measure heart rate remotely from facial videos that suffer heavy compression. It pairs a Spatio-Temporal Video Enhancement Network that restores hidden rPPG details with a dedicated rPPG measurement network, and the stages can be trained jointly. Experiments on benchmark datasets show the combined system outperforms prior approaches on compressed inputs and continues to work when only compressed videos are available for training. A sympathetic reader would care because everyday video transmission always applies compression, so a method that tolerates it could bring contactless heart monitoring into routine remote-healthcare settings.

Core claim

The central claim is that a Spatio-Temporal Video Enhancement Network (STVEN) can recover rPPG information lost to video compression and, when jointly trained with an rPPGNet, enables accurate heart-rate extraction from highly compressed facial videos. The rPPGNet alone already gives robust measurements; adding the jointly trained STVEN further improves results especially under strong compression. The same pipeline also generalizes to entirely new datasets that contain only compressed videos.

What carries the argument

The Spatio-Temporal Video Enhancement Network (STVEN) that restores hidden rPPG signals before they reach the rPPG measurement network, trained end-to-end with it.

If this is right

  • The rPPGNet component can be used by itself for robust measurement when enhancement is not needed.
  • Joint training of the two stages produces the largest gains precisely on the most compressed inputs.
  • Performance holds when the system is trained and tested on novel data that supplies only compressed videos.
  • The approach therefore opens the door to real-world remote-healthcare uses where video is always compressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same recovery idea could be tested on other video-based physiological signals such as respiration rate.
  • In deployed systems the method might lower the bandwidth needed for remote monitoring without sacrificing accuracy.
  • Live-stream tests with changing compression rates would reveal whether the enhancement step remains stable under variable network conditions.

Load-bearing premise

That the hidden rPPG information lost to compression can be recovered by the STVEN enhancement network when the two stages are jointly trained.

What would settle it

If the jointly trained system fails to beat a plain rPPGNet on a fresh collection of highly compressed videos that have no high-quality reference pairs, the recovery claim would be falsified.

Figures

Figures reproduced from arXiv: 1907.11921 by Guoying Zhao, Wei Peng, Xiaobai Li, Xiaopeng Hong, Zitong Yu.

Figure 1
Figure 1. Figure 1: rPPG measurement from highly compressed videos. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the overall framework. There are two models in our framework: video quality enhancement model STVEN (left) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the skin-based attention module of the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Partition constraints with N = 4. by spatial and channel-wise convolutions with residual con￾nections. As there is no ground truth skin map in related rPPG datasets, we generate the binary labels for each frame by adaptive skin segmentation algorithms [27]. With these binary skin labels, the skin segmentation branch is able to predict high quality skin maps S ∈ R T ×H×W . Here we adopt binary cross entropy… view at source ↗
Figure 6
Figure 6. Figure 6: Performance of video quality enhancement networks. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: HR measurement on OBF videos at different bitrates: [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of model output images. (a) face image in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Predicted rPPG signals (top) and corresponding video [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

Remote photoplethysmography (rPPG), which aims at measuring heart activities without any contact, has great potential in many applications (e.g., remote healthcare). Existing rPPG approaches rely on analyzing very fine details of facial videos, which are prone to be affected by video compression. Here we propose a two-stage, end-to-end method using hidden rPPG information enhancement and attention networks, which is the first attempt to counter video compression loss and recover rPPG signals from highly compressed videos. The method includes two parts: 1) a Spatio-Temporal Video Enhancement Network (STVEN) for video enhancement, and 2) an rPPG network (rPPGNet) for rPPG signal recovery. The rPPGNet can work on its own for robust rPPG measurement, and the STVEN network can be added and jointly trained to further boost the performance especially on highly compressed videos. Comprehensive experiments are performed on two benchmark datasets to show that, 1) the proposed method not only achieves superior performance on compressed videos with high-quality videos pair, 2) it also generalizes well on novel data with only compressed videos available, which implies the promising potential for real world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes a two-stage end-to-end deep learning pipeline for remote photoplethysmography (rPPG) heart-rate measurement from highly compressed facial videos. It consists of a Spatio-Temporal Video Enhancement Network (STVEN) that is jointly trained with an rPPGNet to recover subtle color-change signals lost to compression; the rPPGNet can also be used standalone. The central claims are (1) superior performance when paired high-quality/compressed training data are available and (2) good generalization to novel compressed-only test videos on two benchmark datasets, with implications for real-world deployment.

Significance. If the recovery and generalization claims hold under realistic compression mismatch, the work would address a practical barrier in contactless vital-sign monitoring where video streams are routinely compressed. The joint-training architecture and the explicit separation of enhancement and measurement stages are technically interesting strengths; however, the manuscript provides no quantitative results, error bars, dataset statistics, or ablation studies in the supplied abstract, limiting immediate assessment of impact.

major comments (2)
  1. [Abstract] Abstract: the generalization claim ('generalizes well on novel data with only compressed videos available') is load-bearing for the real-world applicability statement, yet the text supplies no information on how the compression parameters (codec, bitrate, GOP structure) of the training pairs compare to those of the novel test videos. Without such detail or a controlled mismatch experiment, it is impossible to evaluate whether STVEN recovers genuine rPPG components or merely learns dataset-specific artifacts.
  2. [Abstract] The weakest assumption identified in the stress-test note is not addressed: joint training of STVEN + rPPGNet can recover information destroyed by compression only if the paired training distribution matches the test compressions. The manuscript does not report any cross-compression validation (e.g., training on H.264 and testing on VP9 or different bitrates), which directly undermines the transfer claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the generalization claims. We address the major comments point-by-point below and will revise the manuscript to improve clarity on experimental details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the generalization claim ('generalizes well on novel data with only compressed videos available') is load-bearing for the real-world applicability statement, yet the text supplies no information on how the compression parameters (codec, bitrate, GOP structure) of the training pairs compare to those of the novel test videos. Without such detail or a controlled mismatch experiment, it is impossible to evaluate whether STVEN recovers genuine rPPG components or merely learns dataset-specific artifacts.

    Authors: We agree the abstract should specify the compression parameters. The full manuscript (Sections 3.3 and 4.1) details that all videos use the H.264 codec; training pairs are created by compressing original high-quality videos at bitrates of 200-800 kbps, while test videos use the same codec at held-out bitrates (e.g., 100-300 kbps) and different subjects to simulate novel compressed data. We will update the abstract with a concise statement of these settings and add a summary table of codec/bitrate configurations. revision: yes

  2. Referee: [Abstract] The weakest assumption identified in the stress-test note is not addressed: joint training of STVEN + rPPGNet can recover information destroyed by compression only if the paired training distribution matches the test compressions. The manuscript does not report any cross-compression validation (e.g., training on H.264 and testing on VP9 or different bitrates), which directly undermines the transfer claim.

    Authors: The reported experiments evaluate generalization across unseen bitrates and subjects within the H.264 codec, which matches common deployment scenarios where the codec remains fixed. No cross-codec tests (H.264 to VP9) or explicit GOP-structure mismatch experiments appear in the manuscript. We will add a limitations paragraph acknowledging this scope and noting it as valuable future work, while retaining the within-codec results as evidence of robustness to bitrate variation. revision: partial

Circularity Check

0 steps flagged

No circularity: performance claims rest on external benchmark training and evaluation, not self-referential definitions or fitted inputs.

full rationale

The paper describes a two-stage neural architecture (STVEN + rPPGNet) trained end-to-end on paired high-quality/compressed video data from external benchmarks. All reported performance numbers and generalization statements are empirical outcomes of that training process rather than quantities derived by algebraic reduction from the model's own parameters or prior self-citations. No equations, uniqueness theorems, or ansatzes are presented that would make any claimed result tautological with its inputs. The central claim therefore remains falsifiable against held-out data and does not reduce to a self-definition or fitted-input prediction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised deep learning assumptions plus the existence of paired high-quality and compressed training videos; the networks themselves introduce large numbers of fitted parameters.

free parameters (1)
  • STVEN and rPPGNet weights
    Millions of parameters in the convolutional and attention layers are fitted during end-to-end training on the benchmark datasets.
axioms (1)
  • domain assumption Deep neural networks can recover rPPG-relevant features from compressed video after spatio-temporal enhancement
    This is the core premise that justifies adding the STVEN stage before rPPGNet.

pith-pipeline@v0.9.0 · 5769 in / 1298 out tokens · 32809 ms · 2026-05-24T14:41:49.633351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Bellard, M

    F. Bellard, M. Niedermayer, and et al. Ffmpeg. [online]. available: http://ffmpeg.org. 6

  2. [2]

    Chaichulee, M

    S. Chaichulee, M. Villarroel, J. Jorge, C. Arteta, G. Green, K. McCormick, A. Zisserman, and L. Tarassenko. Multi-task convolutional neural network for patient detection and skin segmentation in continuous non-contact vital sign monitor- ing. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on , pages 266–

  3. [3]

    Chen and D

    W. Chen and D. McDuff. Deepphys: Video-based physiolog- ical measurement using convolutional attention networks. In ECCV , 2018. 2, 6, 8

  4. [4]

    de Haan and V

    G. de Haan and V . Jeanne. Robust pulse rate from chrominance-based rppg. IEEE Trans. Biomed. Eng. , 60(10):2878–2886, 2013. 1, 2, 4, 6, 7, 8

  5. [5]

    C. Dong, Y . Deng, C. Change Loy, and X. Tang. Compres- sion artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Com- puter Vision, pages 576–584, 2015. 2, 8

  6. [6]

    Galteri, L

    L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo. Deep generative adversarial compression artifact removal. In ICCV , 2017. 3

  7. [7]

    Hanfland and M

    S. Hanfland and M. Paul. Video format dependency of ppgi signals. In Proceedings of the International Conference on Electrical Engineering, 2016. 1, 2

  8. [8]

    ITU-T. Rec. h.262 - information technology - generic coding of moving pictures and associated audio information: Video. International Telecommunication Union Telecommunication Standardization Sector (ITU-T), Tech. Rep., 1995. 2

  9. [9]

    Johnson, A

    J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision , pages 694–711. Springer,

  10. [10]

    Lam and Y

    A. Lam and Y . Kuno. Robust heart rate measurement from video using select random patches. In Proceedings of the IEEE International Conference on Computer Vision , pages 3640–3648, 2015. 2

  11. [11]

    X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-V oltti, M. Tulppo, and G. Zhao. The obf database: A large face video database for remote physio- logical signal measurement and atrial fibrillation detection. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) , pages 242–249. IEEE, 2018. 5, 6, 7

  12. [12]

    X. Li, J. Chen, G. Zhao, and M. Pietik ¨ainen. Remote heart rate measurement from face videos under realistic situations. in CVPR, 2014. 1, 2, 8

  13. [13]

    D. Liu, B. Wen, X. Liu, Z. Wang, and T. S. Huang. When im- age denoising meets high-level vision tasks: A deep learning approach. In IJCAI, 2018. 5

  14. [14]

    D. McDuff. Deep super resolution for recovering physi- ological information from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition Workshops, pages 1367–1374, 2018. 3

  15. [15]

    D. J. McDuff, E. B. Blackford, and J. R. Estepp. The impact of video compression on remote cardiac pulse measurement using imaging photoplethysmography. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE Interna- tional Conference on, pages 63–70. IEEE, 2017. 1, 2, 7

  16. [16]

    X. Niu, H. Han, S. Shan, and X. Chen. Synrhythm: Learning a deep heart rate estimator from general to specific. InICPR,

  17. [17]

    M.-Z. Poh, D. J. McDuff, and R. W. Picard. Non-contact, automated cardiac pulse measurements using video imaging and blind source separation. Opt. Express, 18(10):10762– 10774, 2010. 1, 2

  18. [18]

    M.-Z. Poh, D. J. McDuff, and R. W. Picard. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Trans. Biomed. Eng. , 58(1):7–11,

  19. [19]

    Ponomarenko, F

    N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. As- tola, and V . Lukin. On between-coefficient contrast mask- ing of dct basis functions. In Proceedings of the third inter- national workshop on video processing and quality metrics , volume 4, 2007. 9

  20. [20]

    Puri and A

    A. Puri and A. Eleftheriadis. Mpeg-4: An object-based mul- timedia coding standard supporting mobile applications.Mo- bile Networks and Applications, 3(1):5–32, 1998. 2

  21. [21]

    J. Shi, I. Alikhani, X. Li, Z. Yu, T. Sepp ¨anen, and G. Zhao. Atrial fibrillation detection from face videos by fusing sub- tle variations. IEEE Transactions on Circuits and Systems for Video Technology, DOI 10.1109/TCSVT.2019.2926632,

  22. [22]

    Soleymani, J

    M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic. A multimodal database for affect recognition and implicit tag- ging. IEEE Transactions on Affective Computing , 3(1):42– 55, 2012. 5, 6

  23. [23]

    Spetl ´ık, J

    R. Spetl ´ık, J. Cech, and J. Matas. Non-contact reflectance photoplethysmography: Progress, limitations, and myths. In Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on , pages 702–709. IEEE, 2018. 2, 7 9

  24. [24]

    Srivastava, G

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014. 5

  25. [25]

    G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overview of the high efficiency video coding (hevc) stan- dard. IEEE Transactions on circuits and systems for video technology, 22(12):1649–1668, 2012. 2

  26. [26]

    C. Tang, J. Lu, and J. Liu. Non-contact heart rate monitor- ing by combining convolutional neural network skin detec- tion and remote photoplethysmography via a low-cost cam- era. In Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition Workshops, pages 1309–1315,

  27. [27]

    M. J. Taylor and T. Morris. Adaptive skin segmentation via feature-based face detection. In Real-Time Image and Video Processing 2014, volume 9139, page 91390P. International Society for Optics and Photonics, 2014. 5

  28. [28]

    D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 6450– 6459, 2018. 4

  29. [29]

    Tulyakov, X

    S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn, and N. Sebe. Self-adaptive matrix completion for heart rate estimation from face videos under realistic con- ditions. in CVPR, 2016. 1, 2, 8

  30. [30]

    Verkruysse, L

    W. Verkruysse, L. O. Svaasand, and J. S. Nelson. Remote plethysmographic imaging using ambient light. Opt. Ex- press, 16(26):21434–21445, Dec 2008. 1, 8

  31. [31]

    Viola and M

    P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In null, page 511. IEEE, 2001. 6

  32. [32]

    J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y . Gong. Locality-constrained linear coding for image classification. In 2010 IEEE computer society conference on computer vi- sion and pattern recognition , pages 3360–3367. Citeseer,

  33. [33]

    W. Wang, A. C. den Brinker, S. Stuijk, and G. de Haan. Al- gorithmic principles of remote ppg. IEEE Transactions on Biomedical Engineering, 64(7):1479–1491, 2017. 2, 4, 6, 7

  34. [34]

    W. Wang, S. Stuijk, and G. de Haan. A novel algorithm for remote photoplethysmography: Spatial subspace rota- tion. IEEE Trans. Biomed. Eng. , 63(9):1974–1984, 2016. 2

  35. [35]

    Wiegand, G

    T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology , 13(7):560–576, 2003. 2

  36. [36]

    R. Yang, M. Xu, Z. Wang, and T. Li. Multi-frame quality enhancement for compressed video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 6664–6673, 2018. 3

  37. [37]

    Zhang, W

    K. Zhang, W. Zuo, Y . Chen, D. Meng, and L. Zhang. Be- yond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017. 3, 8

  38. [38]

    Zhao, C.-L

    C. Zhao, C.-L. Lin, W. Chen, and Z. Li. A novel framework for remote photoplethysmography pulse extraction on com- pressed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1299–1308, 2018. 3

  39. [39]

    J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image- to-image translation using cycle-consistent adversarial net- workss. In Computer Vision (ICCV), 2017 IEEE Interna- tional Conference on, 2017. 4

  40. [40]

    petlk, V

    R. petlk, V . Franc, and J. Matas. Visual heart rate estimation with convolutional neural network. In BMVC, 2018. 2, 6, 8 10